Ingredients taxonomy: Difference between revisions
(hungaraian translation) |
|||
(27 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
[[Category:Taxonomies]] | |||
== | == Introduction == | ||
* | == Why? == | ||
* | Why do we need an ingredients ontology? The ontology describes how ingredients are derived from each other and how ingredients can be combined into new ingredients. The basic thesaurus can help to: | ||
* | * '''Normalise ingredients''' - Producers take a lot of freedom in describing the ingredients they use. They use different words (synonyms) to designate the same ingredient. An ontology helps to standardise the ingredients; | ||
* '''Exclusive search''' - the taxonomy can be used to support search for ingredients in any language and multiple synonyms; | |||
* '''Translation''' - as the taxonomy contains translations of each ingredient, it can be used as means to translate ingredient lists; | |||
* '''Ingredients analysis''' - the taxonomy can be extended with properties to indicate if ingredients are suitable for vegans, vegetarians etc. and to estimate how processed the food product is (e.g. NOVA groups) | |||
* '''Ingredient language inconsistencies''' - it happens that ingredients in different languages of a single product are different. That can be ingredients that are left out or simplified. The thesaurus might help revealing these. | |||
If the thesaurus is extended by relations between ingredients, other benefits arise, depending on what is defined in the taxonomy. The basic relation is the isa-relation or is-a-kind-of relation. The isa-relation defines two ingredients that are related. For instance the ingredient ''strawberry'' can also be found as ''strawberry puree'' and ''strawberry juice''. ''Strawberry'' is more generic than ''strawberry puree'', which is more specific. The rule here is: would it make sense to replace the children by the parent in the ingredients list? For instance ''strawberry puree'' and ''strawberry juice'' are still ''strawberry'', but under a different shape. So if you replace ''strawberry puree'' with ''strawberry'' in the ingredient list, it is still a valid ingredient list. Usually ''strawberry'' is the parent and ''strawberry puree'' a child. | |||
This isa-relation makes the following functionalities possible: | |||
* '''Inclusive source search''' - The '''inclusive search''' means that a search for the parent ''strawberry'' will also show the children ''strawberry puree'', etc. The '''source search''' means the the origin of the ingredient, i.e. strawberry , apple, cinnamon, soy, etc. (I want another word for source). | |||
* '''Inclusive condition search''' - in a similar way we could search for the condition of a source ingredient, for instance pureed, juiced, dried, reconditioned, etc. So if we would like to search for all pureed fruits, the relation between [https://world.openfoodfacts.org/ingredient/fruit-puree fruit puree] and [https://world.openfoodfacts.org/ingredient/strawberry-puree strawberry puree] should be defined in the taxonomy. | |||
One could also define other relations in order to support: | |||
* '''Hidden ingredients''' - an ingredient might contain hidden ingredients, the ontology might reveal these. For example [https://world.openfoodfacts.org/ingredient/butter butter] contains [https://world.openfoodfacts.org/ingredient/butterfat butterfat]. | |||
* '''Combined ingredients''' - an ingredient might appear as a single ingredient. In reality however | |||
* '''Processed ingredients''' - often an ingredient is derived from an other ingredient through some process. We can make explicit what these processes are. Example [https://world.openfoodfacts.org/ingredient/clarified-butter clarified butter] is created from [https://world.openfoodfacts.org/ingredient/butter butter] by separating the [https://world.openfoodfacts.org/ingredient/milk-solids milk solids] and [https://world.openfoodfacts.org/ingredient/water water] from the [https://world.openfoodfacts.org/ingredient/butterfat butterfat]. | |||
* '''Ingredient incompleteness''' - often an ingredient is incomplete defined in an ingredient list. For instance if an ingredient-list specifies [https://world.openfoodfacts.org/ingredient/milk milk], it should be defined from which mammal the milk comes from, for instance [https://world.openfoodfacts.org/ingredient/cow-s-milk cow's milk]. | |||
== Taxonomy == | |||
The current OFF ingredients taxonomy ([https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/ingredients.txt github]) can be seen as a thesaurus. (although slowly more info is added). | |||
''In the context of information retrieval, a thesaurus (plural: "thesauri") is a form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects. A thesaurus serves to minimise semantic ambiguity by ensuring uniformity and consistency in the storage and retrieval of the manifestations of content objects. ANSI/NISO Z39.19-2005 defines a content object as "any item that is to be described for inclusion in an information retrieval system, website, or other source of information". The thesaurus aids the assignment of preferred terms to convey semantic metadata associated with the content object''. ([https://en.wikipedia.org/wiki/Thesaurus_(information_retrieval) wikipedia]) | |||
The purpose of the thesaurus is to have a list of ingredients, that occur in ingredient lists, that are unique and are well defined. The OFF ingredients taxonomy is mainly a list of elements that occur in ingredient lists. However the OFF Ingredients taxonomy is more than a simple thesaurus. | |||
en: | === Example === | ||
Below is the ingredient class entry of '''soybean''' in the OFF ingredients taxonomy: | |||
es: | <nowiki>#</nowiki>description:en: SOYBEAN (Glycine max), or soya bean is a species of legume native to East Asia. | ||
<nowiki>#</nowiki>comment:en: There is a bit of confusion whether "bean" should be part of the name, so "soy" or "soybeen"?. | |||
<br /><en:soya | |||
<br />en:soya bean, soy beans, soya beans, soybean | |||
<br />da:Sojabønne | |||
<br />de:Sojabohne, Sojabohnen, Sojakerne | |||
<br />es:granos de soja, habas de soja | |||
<br />et:Sojauba | |||
<br />fi:Soijapapu | |||
<br />fr:fève de soja, fèves de soja, soja entier, graine de soja, graines de soja, graines soja | |||
en: | === Explanation === | ||
* '''Comment-line''' - the ''#'' describes a comment line and can be used to add other information to an ingredient. This information is not used by OFF. | |||
* '''Description''' the # description:en: provides a definition of the ingredient in english. In this case the definition is taken from wikipedia. | |||
* '''Description''' the # comment:en: provides a comment in english about the entry. | |||
* '''Super-ingredient''' - the ''<en:soya'' line describes the super-ingredient ingredient of this (sub-ingredient) ingredient. The relationship between the super-ingredient an the sub-ingredient is an ''is-a-kind-of-relationship''. This means that the sub-ingredient is a kind of super-ingredient. In this example the soybean ingredient is a kind soy ingredient. The sub-ingredient provides more details of the ingredient. In this case the super-ingredient is very broad (something with soy), wheras the sub-ingredient adds the detail of the bean. This line is optional, if no super-ingredient can be specified. | |||
* '''Key''' - The ''en:soya bean'' line is the main name (key) of the ingredient. This doubles as name for the ingredient (in this case in english (''en:'')). | |||
* '''Translations''' - The next lines provide translations of the main ingredient in other languages. One language per line. Each line starts with a language prefix. Thus ''de:'' means german. | |||
* '''Synonyms''' - an ingredient in a language can have multiple synonyms. These synonyms appear as secondary entries on a language line. The first entry is however the main name in that language. | |||
=== Multiple super ingredients === | |||
An ingredient can have multiple super ingredients. This can be used to express the relation between ingredients. These super ingredients reflect the description of an ingredient, such as ''grapefruit juice''. The ingredient ''grapefruit juice'' has two components: the noun ''juice'' and the adjective ''grapefruit''. The noun can be seen as the processing state (purée, juice, flour, etc.) and the adjective as subject of that processing. For an user both might be interesting in helping her: what kind of juice? how was the grapefruit processed? | |||
As the same super-ingredient will also appear in combination with other ingredients, this might help the user understand what other fruits there are and how they are processed and what other juices from fruits exist. | |||
At the moment (2018-10-05) is taxonomy is not consistently structured. It uses sometimes the nouns and sometimes the adjective as structure principle. | |||
These super ingredients and ingredients can also be found in ingredient lists, but then they are indicated by parentheses or colons). For example: ''Tomates (Dés, Jus, Purée)'', ''Yaourt (lait, crème, sucre, ferments lactiques)'', ''crème fraîche (dont lait)'' or ''Jus de : pomme, orange, passion, ananas, citron''. These are four different approaches to super ingredients and define different relations between ingredients. | |||
From a formal semantic viewpoint: juice is the superclass of grapefruit juice (''grapefruit juice'' ''is-a-kind-of'' ''juice''). In an "is-a"kind-of"-relation one could replace the specific formulation ''grapefruit juice'' with the generic formulation ''juice'', and still have a valid sentence. | |||
en: | In the taxonomy multiple super ingredients can be defined as: | ||
<br />'''<en:soya''' | |||
<br />'''<en:sauce''' | |||
<br />'''en:soy sauce''' | |||
<br />'''de:Sojasauce''' | |||
<br /> | |||
en: | Note that ''en:soya'' and ''en:sauce'' should exist as an ingredient in the taxonomy. (is there a consistency check done somewhere?) | ||
==== • Adjectives ==== | |||
The ingredient can be made more specific through adjectives. As these adjectives occur for many ingredients and are essentially the same, they have been seperated into a separate taxonomies. Once they are in that taxonomy, they no longer have to be lised in the ingredients taxonomy. The adjectives can be categorised as: | |||
* '''Processing''' adjectives describe what has been done with an ingredient, i.e. ''ground'', ''cooked'', etc. Either the processing itself is used as adjective or the result, for instance ''ground'' versus ''powdered''. The identified processing adjectives can be found in the [[Processing Taxonomy]]. | |||
* '''Labels''' adjectives that follow the labels taxonomy, such as ''organic'', etc. | |||
==== • Origin list specification ==== | |||
Producers sometimes try to shorten the length of ingredients lists by suppressing repetition. For instance the entry ''vegetable fats (palm, sunflower)'' should actually be read as two elements: ''palm vegetable fat'' and ''sunflower vegetable fat''. In this case the parentheses act as method to avoid repetition. | |||
==== • Processed list specification ==== | |||
The example Tomates (Dés, Jus, Purée) is also used to shorten an ingredient list. In this case the processing used is in the parentheses. | |||
en: | ==== • Compound ingredients ==== | ||
de | It is possible to have a ingredient that in reality consists of multiple other ingredients. For instance ''fr:jus de soja'' can also be specified as ''en:water'' plus ''en:soybean''. On ingredients list this usually appears as ''jus de soja (eau, fèves de soja)'', i.e. the elements between parentheses define the real ingredients. The compound ingredients can also be seen as a (sub-)ingredient list. This is currently NOT encoded as super ingredient. | ||
==== • Made of specification ==== | |||
The example ''crème fraîche (dont lait)'' designates that ''crème fraîche'' is produced from ''lait''. This is usually used to indicate if an ingredient contains an allergenic substance. This is currently NOT encoded as super ingredient. | |||
en: | ==== Exceptions / Reality ==== | ||
As the taxonomy is strongly tied to the url search function some exceptions have been implemented: | |||
* '''en:xxxxx flavour''' - has only one parent: ''<en:flavour''. The parent ''<en:xxxxx'' is not set up, as it would mix the real ingredient ''en:xxxxx'' with artificial flavour ingredient. | |||
=== Language prefix === | |||
The language prefixes are based on the approach taken by [https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Languages Wikipedia]. In practice this implies that the languages prefixes as used by wikidata are used, this inlcudes both the language and the associated script (if applicable). Sometimes the language prefix from the wikipedia-page is used. For two-letter prefixes the same prefix is used in wikipedia and wikidata. For three (and more) letter prefixes this is no longer true ([[language acronyms]]). | |||
=== Application === | |||
==== Inclusive search ==== | |||
The only application for the taxonomy at the moment is search by url. You coud enter the url [https://world.openfoodfacts.org/ingredient/strawberry/ strawberry] to find all products that contain ''strawberry''. Or [https://world.openfoodfacts.org/ingredient/strawberry/language/finnish strawberry] to find all products that have a Finnish ingredient list with ''strawberry'' (puutarhamansikka). This is an inclusive search, which implies that all ingredients that are more specific are included as well in the results. | |||
What is supported as search is determined by parents that are defined in the taxonomy. So ''strawberry flavour'' is not found when searching for ''strawberry''. | |||
=== Practice === | |||
Some experiences have been gathered in building the OFF taxonomy. These can be divided into guiding principles and steps. | |||
en: | ==== Guiding principles ==== | ||
* Ingredient list guides - the entries as found on ingredient lists are the basis and guide the taxonomy. So no entries from other places. (an exception is translation bootstrapping) | |||
* '''Primary language''' - the first language of an ingredient should be in english, if it is available in that language. Otherwise use the language where the ingredient occurs the most (at the moment that will probably be french); | |||
* '''Singular''' - ingredients should be written in singular, i.e. ''apricot'' and not ''apricots''. | |||
* Lower case | |||
* '''No assumption''' - if in doubt about a translation or an assignment or whatever, keep the ingredient separate. Make no assumption about an ingredient. If it specified vaguely, keep it vaguely; | |||
* '''Or's''' - sometime an ingredient is listed as ''X or Y''. This will be entered as <en:X and <en:Y, so there are two superclasses. | |||
==== Steps ==== | |||
To build the taxonomy from raw data, the follow steps need to be taken | |||
* '''Gather''' - the first step is gathering the ingredients from the ingredient lists. This results in a list of ''raw ingredients''. It is possible to create a list with ingredients that are not yet in the taxonomy. (This process needs to be described @stephane does this); | |||
* '''Assign''' - the raw ingredients in a language should be mapped unto the existing taxonomy. For each raw ingredient one has to check whether it is not already in the taxonomy. If the raw ingredient is not yet in the taxonomy one has to check whether it can be assigned to an existing ingredient, either as synonym, or as new translation. The raw ingredient might be totally new of more specified than an existing ingredient (the ''new raw ingredients''); | |||
* '''Merge''' - for each new raw ingredient it must be decided how it will fit in the existing taxonomy. Either it is totally new or it is a child of an existing ingredient.; | |||
* '''Key''' - if a new raw ingredient can be mapped to the taxonomy, it will probably be a new synonym of an existing entry. Ideally one should check which of the synonyms occurs the most and set that one as main ingredient name; | |||
* '''Create''' - if the ingredient is totally new, a new entry can be created. It should be decided what the super ingredient is (if any). It can then be entered in that part of the taxonomy. | |||
* ''Translate'' - try to find the translations of the ingredient. Either by searching translations in OFF. For instance [https://world.openfoodfacts.org/ingredient/fr:polydextrose/languages fr:polydextrose] finds the languages available and try to use the translations offered; | |||
* '''Define''' - add a definition from [http://wikipedia.org wikipedia] and [http://wikidata.org wikidata] if available. | |||
* '''OFF''' - add a link to the corresponding ingredients page. For instance for [https://world.openfoodfacts.org/ingredient/fr:polydextrose/ fr:polydextrose]. | |||
* '''Occurences''' - note how often the ingredient occurs, in how many language and at what date that was determined. This might require an advanced search ([https://world.openfoodfacts.org/cgi/search.pl?action=process&tagtype_0=ingredients&tag_contains_0=contains&tag_0=polydextrose&sort_by=unique_scans_n&page_size=20&axis_x=energy&axis_y=products_n&action=display polydextrose]). This might allow us to track if changes are necessary; | |||
==== Watch out ==== | |||
It is possible to make mistakes in these steps. | |||
* '''Wrong translation''' - the ingredient lists in different languages available on products, do not always offer a translations. | |||
* '''Translation bootstrapping''' - in order to get up to speed with the translations, we use the Wikipedia and Wikidata to find translations. This can result in the wrong translations as it can be something different than what is found on the package. So be careful for wikipedia disambiguation strings, latin species names, etc. In the long run these should be superseded by what is fond on the products. | |||
* '''Contaminated ingredients''' - a lot of products contain text that are not actual ingredients. Either this is caused by text recognition, which recognizes other elements as well. Or people add text that are not ingredients; | |||
== Setup == | |||
Adding a new language for ingredients to OFF, starts with the basics: adding a lot of translations. Without this, your language will not be recognised, handles and analysed. There are two ways to add these translations. | |||
=== Adding translations === | |||
The basic translation work can be done by anyone with a web-browser. You can use the available web-interface for this. For example to translate the dutch ingredients, use this link: [https://world-nl.openfoodfacts.org/ingredients?translate=add]. In this url ''world'' stands for all products in the database and ''nl'' stands for those with dutch as a main language. And the go through the provided list of ingredients. Be careful, as you might easily introduce wrong translations, which are not actually used on ingredient lists. | |||
=== Editing translations === | |||
If you have experience with Github, you can edit the necessary files as well on your desktop. The ingredients are split over multiple files, which allows for specific handling: | |||
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/ingredients.txt Ingredients taxonomy] - this file contains all the normal ingredients; | |||
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/additives.txt Additives] - the file with all the additives; | |||
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/allergens.txt Allergens] - the allergens required by legislation; | |||
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/amino_acids.txt Amino-acids] - the amino acids, required for infant products; | |||
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/minerals.txt Minerals] - the minerals required by legislation; | |||
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/nucleotides.txt Nucleotides] - the nucleotides required by legislation; | |||
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/other_nutritional_substances.txt Other nutritional substances] - also legislation; | |||
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/vitamins.txt Vitamins] - legislation issue; | |||
You will noticed that many ingredients are combined with an adjective either in front or after the ingredient. As many adjectives will occur often, it is not necessary to add all variants to the ingredient taxonomy. This will make the taxonomy smaller and makes processing faster. Several different types of adjectives are identified: | |||
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/ingredients_processing.txt Processes] - describe what has be done to an ingredient (roasted, rehydrated, etc.); | |||
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/labels.txt Labels] - describe the nature (?) of the ingredient, i.e. organic, fair trade, etc; | |||
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/additives_classes.txt Additive classes] - each additive can be assigned to one or more additive class, which describes the role of the additive in the product; | |||
== Maintenance == | |||
Once the bulk of the ingredients have been added and translated, the maintenance phase begins. You will quickly see that the system identifies words and phrases as ingredients, when it should not do that. Time for finetuning and maintenance. | |||
=== Detection === | |||
Finetuning and maintenance starts with detection, detection that an ingredient is not or falsely detected. | |||
* Unknowns - start by creating a list of unknown ingredients with the url: https://nl.openfoodfacts.org/ingredients?status=unknown . This will show all products sold in the Netherlands (nl), with unknown ingredients. | |||
You might find unknown ingredients (prefixed with a language code) in other languages if the main language is not the requested language. | |||
=== Finetuning === | |||
Finetuing allows you to tell the system, where the ingredients start and end, which words should not be taken into account, etc. You might regularly return to this task, as the system analyses more and more ingredient lists. | |||
* Adapt file [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/lib/ProductOpener/Ingredients.pm ingredients.pm] - this file contains various elements, which are used to analyse the ingredients of a product (in the order they appear in the file): | |||
** Traces (''%may_contain_regexps'') - is the start of a phrase that introduces traces; | |||
** Allergens (''%contains_regexps'') - appears after an ingredient, to indicate an allergen; | |||
** Abreviations (''%abbreviations'') - commonly used abbreviations in ingredient lists; | |||
** Or (''%of'') - an optional choice between ingredients, eg '''colza or sunflower oil'''; | |||
** And (''%and'') - two ingredients, not separated by a delimiter, eg '''herbs and spices'''; | |||
** And of (''%and_of'') - | |||
** And/or (''%and_or'') - | |||
** The (''%the'') - the [https://en.wikipedia.org/wiki/The grammatical article] of the language. | |||
** Ignore after percent (''%ignore_strings_after_percent'') - a phrase that can be discard after it appears a %-sign; | |||
** Ignore phrases (''%ignore_regexps'') - phrases that can be ignored (black list); | |||
** Ingredients start (''%phrases_before_ingredients_list'') - this word is used as the beginning a the ingredient list, and helps OFF determine where the recognition should start. | |||
** Ingredients start all caps (''%phrases_before_ingredients_list_uppercase'') - For example in[https://static.openfoodfacts.org/images/products/002/222/420/0820/ingredients_en.4.full.jpg this ingredient list], the lis starts with '''INGREDIENTS'''; | |||
** After ingredients (''%phrases_after_ingredients_list'') - an ingredient list can end in many different ways. These can be listed here; | |||
** Wrong dashes (%prefixes_before_dash) - this allows words separated by dash to be combined, eg '''demi - écrémé''', wil be changed to '''demi-écrémé'''. | |||
You might require special handling, which in turn requires specific code. Please contact the OFF'ers on Slack for that (#ingredients). | |||
=== Adding === | |||
- Finding unknown ingredients | |||
- Adding processes | |||
- Adding labels | |||
- Adding black listed ingredients | |||
- Language transposes | |||
== Related pages== | |||
There is more documentation that is related ingredients and the corresponding taxonomy: | |||
* [[Ingredients Events list]] | |||
[[Category:Taxonomies]] | |||
[[Category:Ingredients]] |
Latest revision as of 07:17, 23 October 2023
Introduction
Why?
Why do we need an ingredients ontology? The ontology describes how ingredients are derived from each other and how ingredients can be combined into new ingredients. The basic thesaurus can help to:
- Normalise ingredients - Producers take a lot of freedom in describing the ingredients they use. They use different words (synonyms) to designate the same ingredient. An ontology helps to standardise the ingredients;
- Exclusive search - the taxonomy can be used to support search for ingredients in any language and multiple synonyms;
- Translation - as the taxonomy contains translations of each ingredient, it can be used as means to translate ingredient lists;
- Ingredients analysis - the taxonomy can be extended with properties to indicate if ingredients are suitable for vegans, vegetarians etc. and to estimate how processed the food product is (e.g. NOVA groups)
- Ingredient language inconsistencies - it happens that ingredients in different languages of a single product are different. That can be ingredients that are left out or simplified. The thesaurus might help revealing these.
If the thesaurus is extended by relations between ingredients, other benefits arise, depending on what is defined in the taxonomy. The basic relation is the isa-relation or is-a-kind-of relation. The isa-relation defines two ingredients that are related. For instance the ingredient strawberry can also be found as strawberry puree and strawberry juice. Strawberry is more generic than strawberry puree, which is more specific. The rule here is: would it make sense to replace the children by the parent in the ingredients list? For instance strawberry puree and strawberry juice are still strawberry, but under a different shape. So if you replace strawberry puree with strawberry in the ingredient list, it is still a valid ingredient list. Usually strawberry is the parent and strawberry puree a child.
This isa-relation makes the following functionalities possible:
- Inclusive source search - The inclusive search means that a search for the parent strawberry will also show the children strawberry puree, etc. The source search means the the origin of the ingredient, i.e. strawberry , apple, cinnamon, soy, etc. (I want another word for source).
- Inclusive condition search - in a similar way we could search for the condition of a source ingredient, for instance pureed, juiced, dried, reconditioned, etc. So if we would like to search for all pureed fruits, the relation between fruit puree and strawberry puree should be defined in the taxonomy.
One could also define other relations in order to support:
- Hidden ingredients - an ingredient might contain hidden ingredients, the ontology might reveal these. For example butter contains butterfat.
- Combined ingredients - an ingredient might appear as a single ingredient. In reality however
- Processed ingredients - often an ingredient is derived from an other ingredient through some process. We can make explicit what these processes are. Example clarified butter is created from butter by separating the milk solids and water from the butterfat.
- Ingredient incompleteness - often an ingredient is incomplete defined in an ingredient list. For instance if an ingredient-list specifies milk, it should be defined from which mammal the milk comes from, for instance cow's milk.
Taxonomy
The current OFF ingredients taxonomy (github) can be seen as a thesaurus. (although slowly more info is added).
In the context of information retrieval, a thesaurus (plural: "thesauri") is a form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects. A thesaurus serves to minimise semantic ambiguity by ensuring uniformity and consistency in the storage and retrieval of the manifestations of content objects. ANSI/NISO Z39.19-2005 defines a content object as "any item that is to be described for inclusion in an information retrieval system, website, or other source of information". The thesaurus aids the assignment of preferred terms to convey semantic metadata associated with the content object. (wikipedia)
The purpose of the thesaurus is to have a list of ingredients, that occur in ingredient lists, that are unique and are well defined. The OFF ingredients taxonomy is mainly a list of elements that occur in ingredient lists. However the OFF Ingredients taxonomy is more than a simple thesaurus.
Example
Below is the ingredient class entry of soybean in the OFF ingredients taxonomy:
#description:en: SOYBEAN (Glycine max), or soya bean is a species of legume native to East Asia.
#comment:en: There is a bit of confusion whether "bean" should be part of the name, so "soy" or "soybeen"?.
<en:soya
en:soya bean, soy beans, soya beans, soybean
da:Sojabønne
de:Sojabohne, Sojabohnen, Sojakerne
es:granos de soja, habas de soja
et:Sojauba
fi:Soijapapu
fr:fève de soja, fèves de soja, soja entier, graine de soja, graines de soja, graines soja
Explanation
- Comment-line - the # describes a comment line and can be used to add other information to an ingredient. This information is not used by OFF.
- Description the # description:en: provides a definition of the ingredient in english. In this case the definition is taken from wikipedia.
- Description the # comment:en: provides a comment in english about the entry.
- Super-ingredient - the <en:soya line describes the super-ingredient ingredient of this (sub-ingredient) ingredient. The relationship between the super-ingredient an the sub-ingredient is an is-a-kind-of-relationship. This means that the sub-ingredient is a kind of super-ingredient. In this example the soybean ingredient is a kind soy ingredient. The sub-ingredient provides more details of the ingredient. In this case the super-ingredient is very broad (something with soy), wheras the sub-ingredient adds the detail of the bean. This line is optional, if no super-ingredient can be specified.
- Key - The en:soya bean line is the main name (key) of the ingredient. This doubles as name for the ingredient (in this case in english (en:)).
- Translations - The next lines provide translations of the main ingredient in other languages. One language per line. Each line starts with a language prefix. Thus de: means german.
- Synonyms - an ingredient in a language can have multiple synonyms. These synonyms appear as secondary entries on a language line. The first entry is however the main name in that language.
Multiple super ingredients
An ingredient can have multiple super ingredients. This can be used to express the relation between ingredients. These super ingredients reflect the description of an ingredient, such as grapefruit juice. The ingredient grapefruit juice has two components: the noun juice and the adjective grapefruit. The noun can be seen as the processing state (purée, juice, flour, etc.) and the adjective as subject of that processing. For an user both might be interesting in helping her: what kind of juice? how was the grapefruit processed?
As the same super-ingredient will also appear in combination with other ingredients, this might help the user understand what other fruits there are and how they are processed and what other juices from fruits exist.
At the moment (2018-10-05) is taxonomy is not consistently structured. It uses sometimes the nouns and sometimes the adjective as structure principle.
These super ingredients and ingredients can also be found in ingredient lists, but then they are indicated by parentheses or colons). For example: Tomates (Dés, Jus, Purée), Yaourt (lait, crème, sucre, ferments lactiques), crème fraîche (dont lait) or Jus de : pomme, orange, passion, ananas, citron. These are four different approaches to super ingredients and define different relations between ingredients.
From a formal semantic viewpoint: juice is the superclass of grapefruit juice (grapefruit juice is-a-kind-of juice). In an "is-a"kind-of"-relation one could replace the specific formulation grapefruit juice with the generic formulation juice, and still have a valid sentence.
In the taxonomy multiple super ingredients can be defined as:
<en:soya
<en:sauce
en:soy sauce
de:Sojasauce
Note that en:soya and en:sauce should exist as an ingredient in the taxonomy. (is there a consistency check done somewhere?)
• Adjectives
The ingredient can be made more specific through adjectives. As these adjectives occur for many ingredients and are essentially the same, they have been seperated into a separate taxonomies. Once they are in that taxonomy, they no longer have to be lised in the ingredients taxonomy. The adjectives can be categorised as:
- Processing adjectives describe what has been done with an ingredient, i.e. ground, cooked, etc. Either the processing itself is used as adjective or the result, for instance ground versus powdered. The identified processing adjectives can be found in the Processing Taxonomy.
- Labels adjectives that follow the labels taxonomy, such as organic, etc.
• Origin list specification
Producers sometimes try to shorten the length of ingredients lists by suppressing repetition. For instance the entry vegetable fats (palm, sunflower) should actually be read as two elements: palm vegetable fat and sunflower vegetable fat. In this case the parentheses act as method to avoid repetition.
• Processed list specification
The example Tomates (Dés, Jus, Purée) is also used to shorten an ingredient list. In this case the processing used is in the parentheses.
• Compound ingredients
It is possible to have a ingredient that in reality consists of multiple other ingredients. For instance fr:jus de soja can also be specified as en:water plus en:soybean. On ingredients list this usually appears as jus de soja (eau, fèves de soja), i.e. the elements between parentheses define the real ingredients. The compound ingredients can also be seen as a (sub-)ingredient list. This is currently NOT encoded as super ingredient.
• Made of specification
The example crème fraîche (dont lait) designates that crème fraîche is produced from lait. This is usually used to indicate if an ingredient contains an allergenic substance. This is currently NOT encoded as super ingredient.
Exceptions / Reality
As the taxonomy is strongly tied to the url search function some exceptions have been implemented:
- en:xxxxx flavour - has only one parent: <en:flavour. The parent <en:xxxxx is not set up, as it would mix the real ingredient en:xxxxx with artificial flavour ingredient.
Language prefix
The language prefixes are based on the approach taken by Wikipedia. In practice this implies that the languages prefixes as used by wikidata are used, this inlcudes both the language and the associated script (if applicable). Sometimes the language prefix from the wikipedia-page is used. For two-letter prefixes the same prefix is used in wikipedia and wikidata. For three (and more) letter prefixes this is no longer true (language acronyms).
Application
Inclusive search
The only application for the taxonomy at the moment is search by url. You coud enter the url strawberry to find all products that contain strawberry. Or strawberry to find all products that have a Finnish ingredient list with strawberry (puutarhamansikka). This is an inclusive search, which implies that all ingredients that are more specific are included as well in the results.
What is supported as search is determined by parents that are defined in the taxonomy. So strawberry flavour is not found when searching for strawberry.
Practice
Some experiences have been gathered in building the OFF taxonomy. These can be divided into guiding principles and steps.
Guiding principles
- Ingredient list guides - the entries as found on ingredient lists are the basis and guide the taxonomy. So no entries from other places. (an exception is translation bootstrapping)
- Primary language - the first language of an ingredient should be in english, if it is available in that language. Otherwise use the language where the ingredient occurs the most (at the moment that will probably be french);
- Singular - ingredients should be written in singular, i.e. apricot and not apricots.
- Lower case
- No assumption - if in doubt about a translation or an assignment or whatever, keep the ingredient separate. Make no assumption about an ingredient. If it specified vaguely, keep it vaguely;
- Or's - sometime an ingredient is listed as X or Y. This will be entered as <en:X and <en:Y, so there are two superclasses.
Steps
To build the taxonomy from raw data, the follow steps need to be taken
- Gather - the first step is gathering the ingredients from the ingredient lists. This results in a list of raw ingredients. It is possible to create a list with ingredients that are not yet in the taxonomy. (This process needs to be described @stephane does this);
- Assign - the raw ingredients in a language should be mapped unto the existing taxonomy. For each raw ingredient one has to check whether it is not already in the taxonomy. If the raw ingredient is not yet in the taxonomy one has to check whether it can be assigned to an existing ingredient, either as synonym, or as new translation. The raw ingredient might be totally new of more specified than an existing ingredient (the new raw ingredients);
- Merge - for each new raw ingredient it must be decided how it will fit in the existing taxonomy. Either it is totally new or it is a child of an existing ingredient.;
- Key - if a new raw ingredient can be mapped to the taxonomy, it will probably be a new synonym of an existing entry. Ideally one should check which of the synonyms occurs the most and set that one as main ingredient name;
- Create - if the ingredient is totally new, a new entry can be created. It should be decided what the super ingredient is (if any). It can then be entered in that part of the taxonomy.
- Translate - try to find the translations of the ingredient. Either by searching translations in OFF. For instance fr:polydextrose finds the languages available and try to use the translations offered;
- Define - add a definition from wikipedia and wikidata if available.
- OFF - add a link to the corresponding ingredients page. For instance for fr:polydextrose.
- Occurences - note how often the ingredient occurs, in how many language and at what date that was determined. This might require an advanced search (polydextrose). This might allow us to track if changes are necessary;
Watch out
It is possible to make mistakes in these steps.
- Wrong translation - the ingredient lists in different languages available on products, do not always offer a translations.
- Translation bootstrapping - in order to get up to speed with the translations, we use the Wikipedia and Wikidata to find translations. This can result in the wrong translations as it can be something different than what is found on the package. So be careful for wikipedia disambiguation strings, latin species names, etc. In the long run these should be superseded by what is fond on the products.
- Contaminated ingredients - a lot of products contain text that are not actual ingredients. Either this is caused by text recognition, which recognizes other elements as well. Or people add text that are not ingredients;
Setup
Adding a new language for ingredients to OFF, starts with the basics: adding a lot of translations. Without this, your language will not be recognised, handles and analysed. There are two ways to add these translations.
Adding translations
The basic translation work can be done by anyone with a web-browser. You can use the available web-interface for this. For example to translate the dutch ingredients, use this link: [1]. In this url world stands for all products in the database and nl stands for those with dutch as a main language. And the go through the provided list of ingredients. Be careful, as you might easily introduce wrong translations, which are not actually used on ingredient lists.
Editing translations
If you have experience with Github, you can edit the necessary files as well on your desktop. The ingredients are split over multiple files, which allows for specific handling:
- Ingredients taxonomy - this file contains all the normal ingredients;
- Additives - the file with all the additives;
- Allergens - the allergens required by legislation;
- Amino-acids - the amino acids, required for infant products;
- Minerals - the minerals required by legislation;
- Nucleotides - the nucleotides required by legislation;
- Other nutritional substances - also legislation;
- Vitamins - legislation issue;
You will noticed that many ingredients are combined with an adjective either in front or after the ingredient. As many adjectives will occur often, it is not necessary to add all variants to the ingredient taxonomy. This will make the taxonomy smaller and makes processing faster. Several different types of adjectives are identified:
- Processes - describe what has be done to an ingredient (roasted, rehydrated, etc.);
- Labels - describe the nature (?) of the ingredient, i.e. organic, fair trade, etc;
- Additive classes - each additive can be assigned to one or more additive class, which describes the role of the additive in the product;
Maintenance
Once the bulk of the ingredients have been added and translated, the maintenance phase begins. You will quickly see that the system identifies words and phrases as ingredients, when it should not do that. Time for finetuning and maintenance.
Detection
Finetuning and maintenance starts with detection, detection that an ingredient is not or falsely detected.
- Unknowns - start by creating a list of unknown ingredients with the url: https://nl.openfoodfacts.org/ingredients?status=unknown . This will show all products sold in the Netherlands (nl), with unknown ingredients.
You might find unknown ingredients (prefixed with a language code) in other languages if the main language is not the requested language.
Finetuning
Finetuing allows you to tell the system, where the ingredients start and end, which words should not be taken into account, etc. You might regularly return to this task, as the system analyses more and more ingredient lists.
- Adapt file ingredients.pm - this file contains various elements, which are used to analyse the ingredients of a product (in the order they appear in the file):
- Traces (%may_contain_regexps) - is the start of a phrase that introduces traces;
- Allergens (%contains_regexps) - appears after an ingredient, to indicate an allergen;
- Abreviations (%abbreviations) - commonly used abbreviations in ingredient lists;
- Or (%of) - an optional choice between ingredients, eg colza or sunflower oil;
- And (%and) - two ingredients, not separated by a delimiter, eg herbs and spices;
- And of (%and_of) -
- And/or (%and_or) -
- The (%the) - the grammatical article of the language.
- Ignore after percent (%ignore_strings_after_percent) - a phrase that can be discard after it appears a %-sign;
- Ignore phrases (%ignore_regexps) - phrases that can be ignored (black list);
- Ingredients start (%phrases_before_ingredients_list) - this word is used as the beginning a the ingredient list, and helps OFF determine where the recognition should start.
- Ingredients start all caps (%phrases_before_ingredients_list_uppercase) - For example inthis ingredient list, the lis starts with INGREDIENTS;
- After ingredients (%phrases_after_ingredients_list) - an ingredient list can end in many different ways. These can be listed here;
- Wrong dashes (%prefixes_before_dash) - this allows words separated by dash to be combined, eg demi - écrémé, wil be changed to demi-écrémé.
You might require special handling, which in turn requires specific code. Please contact the OFF'ers on Slack for that (#ingredients).
Adding
- Finding unknown ingredients - Adding processes - Adding labels - Adding black listed ingredients - Language transposes
Related pages
There is more documentation that is related ingredients and the corresponding taxonomy: