Ingredients ontology: Difference between revisions

Latest revision as of 08:29, 23 October 2023

Introduction

The ingredients taxonomy is a collection of all the names of ingredients found on the product. This taxonomy is used for analysing the ingredients of a product and deriving qualities like NutriScore, NOVA or Eco-Score. It can also be used for translation purposes.

For an introduction to taxonomies in general, have a look at Taxonomies.

If you would like to help in maintaining the ingredients taxonomy (and others), but do not want to read all the details, have a look at Taxonomy Maintenance.

This document discusses the details of the taxonomy and possible evolutions.

Why?

Why do we need an ingredients ontology? The ontology describes how ingredients are derived from each other and how ingredients can be combined into new ingredients. The basic thesaurus can help to:

Normalise ingredients - Producers take a lot of freedom in describing the ingredients they use. They use different words (synonyms) to designate the same ingredient. An ontology helps to standardise the ingredients;
Exclusive search - the taxonomy can be used to support search for ingredients in any language and multiple synonyms;
Translation - as the taxonomy contains translations of each ingredient, it can be used as means to translate ingredient lists;
Ingredients analysis - the taxonomy can be extended with properties to indicate if ingredients are suitable for vegans, vegetarians etc. and to estimate how processed the food product is (e.g. NOVA groups)
Ingredient language inconsistencies - it happens that ingredients in different languages of a single product are different. That can be ingredients that are left out or simplified. The thesaurus might help revealing these.

If the thesaurus is extended by relations between ingredients, other benefits arise, depending on what is defined in the taxonomy. The basic relation is the isa-relation or is-a-kind-of relation. The isa-relation defines two ingredients that are related. For instance the ingredient strawberry can also be found as strawberry puree and strawberry juice. Strawberry is more generic than strawberry puree, which is more specific. The rule here is: would it make sense to replace the children by the parent in the ingredients list? For instance strawberry puree and strawberry juice are still strawberry, but under a different shape. So if you replace strawberry puree with strawberry in the ingredient list, it is still a valid ingredient list. Usually strawberry is the parent and strawberry puree a child.

This isa-relation makes the following functionalities possible:

Inclusive source search - The inclusive search means that a search for the parent strawberry will also show the children strawberry puree, etc. The source search means the the origin of the ingredient, i.e. strawberry , apple, cinnamon, soy, etc. (I want another word for source).
Inclusive condition search - in a similar way we could search for the condition of a source ingredient, for instance pureed, juiced, dried, reconditioned, etc. So if we would like to search for all pureed fruits, the relation between fruit puree and strawberry puree should be defined in the taxonomy.

One could also define other relations in order to support:

Hidden ingredients - an ingredient might contain hidden ingredients, the ontology might reveal these. For example butter contains butterfat.
Combined ingredients - an ingredient might appear as a single ingredient. In reality however
Processed ingredients - often an ingredient is derived from an other ingredient through some process. We can make explicit what these processes are. Example clarified butter is created from butter by separating the milk solids and water from the butterfat.
Ingredient incompleteness - often an ingredient is incomplete defined in an ingredient list. For instance if an ingredient-list specifies milk, it should be defined from which mammal the milk comes from, for instance cow's milk.

Taxonomy

The current OFF ingredients taxonomy (github) can be seen as a thesaurus. (although slowly more info is added).

In the context of information retrieval, a thesaurus (plural: "thesauri") is a form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects. A thesaurus serves to minimise semantic ambiguity by ensuring uniformity and consistency in the storage and retrieval of the manifestations of content objects. ANSI/NISO Z39.19-2005 defines a content object as "any item that is to be described for inclusion in an information retrieval system, website, or other source of information". The thesaurus aids the assignment of preferred terms to convey semantic metadata associated with the content object. (wikipedia)

The purpose of the thesaurus is to have a list of ingredients, that occur in ingredient lists, that are unique and are well defined. The OFF ingredients taxonomy is mainly a list of elements that occur in ingredient lists. However the OFF Ingredients taxonomy is more than a simple thesaurus.

Example

Below is the ingredient class entry of soybean in the OFF ingredients taxonomy:
#description:en: SOYBEAN (Glycine max), or soya bean is a species of legume native to East Asia.
#comment:en: There is a bit of confusion whether "bean" should be part of the name, so "soy" or "soybeen"?.
<en:soya
en:soya bean, soy beans, soya beans, soybean
da:Sojabønne
de:Sojabohne, Sojabohnen, Sojakerne
es:granos de soja, habas de soja
et:Sojauba
fi:Soijapapu
fr:fève de soja, fèves de soja, soja entier, graine de soja, graines de soja, graines soja

Explanation

Comment-line - the # describes a comment line and can be used to add other information to an ingredient. This information is not used by OFF.
Description the # description:en: provides a definition of the ingredient in english. In this case the definition is taken from wikipedia.
Description the # comment:en: provides a comment in english about the entry.
Super-ingredient - the <en:soya line describes the super-ingredient ingredient of this (sub-ingredient) ingredient. The relationship between the super-ingredient an the sub-ingredient is an is-a-kind-of-relationship. This means that the sub-ingredient is a kind of super-ingredient. In this example the soybean ingredient is a kind soy ingredient. The sub-ingredient provides more details of the ingredient. In this case the super-ingredient is very broad (something with soy), wheras the sub-ingredient adds the detail of the bean. This line is optional, if no super-ingredient can be specified.
Key - The en:soya bean line is the main name (key) of the ingredient. This doubles as name for the ingredient (in this case in english (en:)).
Translations - The next lines provide translations of the main ingredient in other languages. One language per line. Each line starts with a language prefix. Thus de: means german.
Synonyms - an ingredient in a language can have multiple synonyms. These synonyms appear as secondary entries on a language line. The first entry is however the main name in that language.

Multiple super ingredients

An ingredient can have multiple super ingredients. This can be used to express the relation between ingredients. These super ingredients reflect the description of an ingredient, such as grapefruit juice. The ingredient grapefruit juice has two components: the noun juice and the adjective grapefruit. The noun can be seen as the processing state (purée, juice, flour, etc.) and the adjective as subject of that processing. For an user both might be interesting in helping her: what kind of juice? how was the grapefruit processed?

As the same super-ingredient will also appear in combination with other ingredients, this might help the user understand what other fruits there are and how they are processed and what other juices from fruits exist.

At the moment (2018-10-05) is taxonomy is not consistently structured. It uses sometimes the nouns and sometimes the adjective as structure principle.

These super ingredients and ingredients can also be found in ingredient lists, but then they are indicated by parentheses or colons). For example: Tomates (Dés, Jus, Purée), Yaourt (lait, crème, sucre, ferments lactiques), crème fraîche (dont lait) or Jus de : pomme, orange, passion, ananas, citron. These are four different approaches to super ingredients and define different relations between ingredients.

From a formal semantic viewpoint: juice is the superclass of grapefruit juice (grapefruit juice is-a-kind-of juice). In an "is-a"kind-of"-relation one could replace the specific formulation grapefruit juice with the generic formulation juice, and still have a valid sentence.

In the taxonomy multiple super ingredients can be defined as:
<en:soya
<en:sauce
en:soy sauce
de:Sojasauce

Note that en:soya and en:sauce should exist as an ingredient in the taxonomy. (is there a consistency check done somewhere?)

• Adjectives

The ingredient can be made more specific through adjectives. As these adjectives occur for many ingredients and are essentially the same, they have been seperated into a separate taxonomies. Once they are in that taxonomy, they no longer have to be lised in the ingredients taxonomy. The adjectives can be categorised as:

Ciqual food code is an unique code in format "ciqual_food_code:en:nnnn";
Ciqual food name is the corresponding name in english in format "ciqual_food_name:en:aaaa" and in french"ciqual_food_name:fr:aaaa";
Description is a more formal description (definition) of the ingredient, which can be used on the website for instance. This can be in any language. The definition can be taken from the first line of the wikipedia page. It should comprise the adjectives that define the ingredient.
Processing adjectives describe what has been done with an ingredient, i.e. ground, cooked, etc. Either the processing itself is used as adjective or the result, for instance ground versus powdered. The identified processing adjectives can be found in the Processing Taxonomy.
Labels adjectives that follow the labels taxonomy, such as organic, etc.
Wikidata indicates the entry on wikidata.org. The format should be "wikidata:en:Qnnnnn".
Wikipedia indicates the link to an entry on wikipedia in english (if available). The format is should be "wikipedia:en:https://wikipedia.org/wiki/aaaaaa".

• Origin list specification

Producers sometimes try to shorten the length of ingredients lists by suppressing repetition. For instance the entry vegetable fats (palm, sunflower) should actually be read as two elements: palm vegetable fat and sunflower vegetable fat. In this case the parentheses act as method to avoid repetition.

• Processed list specification

The example Tomates (Dés, Jus, Purée) is also used to shorten an ingredient list. In this case the processing used is in the parentheses.

• Compound ingredients

It is possible to have a ingredient that in reality consists of multiple other ingredients. For instance fr:jus de soja can also be specified as en:water plus en:soybean. On ingredients list this usually appears as jus de soja (eau, fèves de soja), i.e. the elements between parentheses define the real ingredients. The compound ingredients can also be seen as a (sub-)ingredient list. This is currently NOT encoded as super ingredient.

• Made of specification

The example crème fraîche (dont lait) designates that crème fraîche is produced from lait. This is usually used to indicate if an ingredient contains an allergenic substance. This is currently NOT encoded as super ingredient.

Exceptions / Reality

As the taxonomy is strongly tied to the url search function some exceptions have been implemented:

en:xxxxx flavour - has only one parent: <en:flavour. The parent <en:xxxxx is not set up, as it would mix the real ingredient en:xxxxx with artificial flavour ingredient.

Language prefix

The language prefixes are based on the approach taken by Wikipedia. In practice this implies that the languages prefixes as used by wikidata are used, this inlcudes both the language and the associated script (if applicable). Sometimes the language prefix from the wikipedia-page is used. For two-letter prefixes the same prefix is used in wikipedia and wikidata. For three (and more) letter prefixes this is no longer true (language acronyms).

Application

Inclusive search

The only application for the taxonomy at the moment is search by url. You coud enter the url strawberry to find all products that contain strawberry. Or strawberry to find all products that have a Finnish ingredient list with strawberry (puutarhamansikka). This is an inclusive search, which implies that all ingredients that are more specific are included as well in the results.

What is supported as search is determined by parents that are defined in the taxonomy. So strawberry flavour is not found when searching for strawberry.

Practice

Some experiences have been gathered in building the OFF taxonomy. These can be divided into guiding principles and steps.

Guiding principles

Ingredient list guides - the entries as found on ingredient lists are the basis and guide the taxonomy. So no entries from other places. (an exception is translation bootstrapping)
Primary language - the first language of an ingredient should be in english, if it is available in that language. Otherwise use the language where the ingredient occurs the most (at the moment that will probably be french);
Singular - ingredients should be written in singular, i.e. apricot and not apricots.
Lower case
No assumption - if in doubt about a translation or an assignment or whatever, keep the ingredient separate. Make no assumption about an ingredient. If it specified vaguely, keep it vaguely;
Or's - sometime an ingredient is listed as X or Y. This will be entered as <en:X and <en:Y, so there are two superclasses.

Steps

To build the taxonomy from raw data, the follow steps need to be taken

Gather - the first step is gathering the ingredients from the ingredient lists. This results in a list of raw ingredients. It is possible to create a list with ingredients that are not yet in the taxonomy. (This process needs to be described @stephane does this);
Assign - the raw ingredients in a language should be mapped unto the existing taxonomy. For each raw ingredient one has to check whether it is not already in the taxonomy. If the raw ingredient is not yet in the taxonomy one has to check whether it can be assigned to an existing ingredient, either as synonym, or as new translation. The raw ingredient might be totally new of more specified than an existing ingredient (the new raw ingredients);
Merge - for each new raw ingredient it must be decided how it will fit in the existing taxonomy. Either it is totally new or it is a child of an existing ingredient.;
Key - if a new raw ingredient can be mapped to the taxonomy, it will probably be a new synonym of an existing entry. Ideally one should check which of the synonyms occurs the most and set that one as main ingredient name;
Create - if the ingredient is totally new, a new entry can be created. It should be decided what the super ingredient is (if any). It can then be entered in that part of the taxonomy.
Translate - try to find the translations of the ingredient. Either by searching translations in OFF. For instance fr:polydextrose finds the languages available and try to use the translations offered;
Define - add a definition from wikipedia and wikidata if available.
OFF - add a link to the corresponding ingredients page. For instance for fr:polydextrose.
Occurences - note how often the ingredient occurs, in how many language and at what date that was determined. This might require an advanced search (polydextrose). This might allow us to track if changes are necessary;

Watch out

It is possible to make mistakes in these steps.

Wrong translation - the ingredient lists in different languages available on products, do not always offer a translations.
Translation bootstrapping - in order to get up to speed with the translations, we use the Wikipedia and Wikidata to find translations. This can result in the wrong translations as it can be something different than what is found on the package. So be careful for wikipedia disambiguation strings, latin species names, etc. In the long run these should be superseded by what is fond on the products.
Contaminated ingredients - a lot of products contain text that are not actual ingredients. Either this is caused by text recognition, which recognizes other elements as well. Or people add text that are not ingredients;

Theory

What theory can be used to base an food ingredients taxonomy on? Is there already a food ontology somewhere?

OWL

I get the impression that OWL is the most accepted markup language to exchange semantic networks. I will use the Manchester markup to describe the axioms.

Tools

Discovering Protégé from Stanford at the moment. There are standalone versions and web-versions of the app. Maybe the web-version can be used to maintain the ontology by OFF-users. Unfortunately the web-version does not support all the detailed modelling options.

Existing ontologies

It would be a pity if we have to develop ontologies for all the

Languages

Several language ontologies exist:

Lingvoj seems to be a languages ontology, which is available in rdf. It is not clear how to download the individual language files.
The EU has a XML and SKOS download (Language NAL).
Lexvo (RDF-file can not be loaded by Protégé)

These solutions seem to be overkill for our needs.

Food

Can we use LinguaL in some way? Maybe for inspiration.

DBPedia

This should provide links to wikipedia, short summaries and be multilingual.

Geography

This is needed to get some structure for locations (useful for IGP).

Labels

Are there taxonomies/ontologies for labels?

Issues

It is not easy to convert all aspects of OFF to OWL and Protégé. Some issues that have been encountered:

Translations
Inheritance

Ontology building

This section describes how the ontology is created starting from the ingredients found on the ingredient lists of products.

Individuals

The purpose of the ontology is to have a set of axioms between canonical ingredients, which is logically consistent. If a producer puts an ingredient in an ingredient list (the raw ingredient), it will first be normalised to a single canonical ingredient name. This canonical name will define the ingredient class. These canonical ingredient names are the individuals in OWL/Protégé terminology.

Raw ingredients

The creation of the ontology starts with the ingredients as found on the ingredient lists. A raw ingredient is defined as the the text between two ingredient dividers (where are these defined?). This might include any percentage, organic markup, allergen markup, origin and sub-ingredients. In combination with the language used, it gives a first indication of the ingredient class.

Normalized ingredients

The raw ingredients must be transformed to normalized ingredients. A normalized ingredient is a single string in a western script (assumes that the western script can be read by any one in the world). A normalized ingredient is preceded by the language code of the ingredient. Usually this in english if the translation is available. (It is possible that the from language is needed for a better comprehension when classifying ingredients).

Normalizing ingredients implies that any variant of an an ingredient in any language is mapped onto a single name. The normalizing procedure is mainly going through a pattern recognition procedure.

For protégé this implies that any normalized ingredient individual must have a data property, which defines the data corresponding to the individual. We encode this as:

Individual: en:Butter
    Facts: hasName "en:butter" ^^xsd:string

Probably it is possible to bypass this normalisation step, but that makes the encoding in OWL much more extended.

Normalisation procedure

keep and show - try to keep the original entry as much as possible (remove typo's, etc) and show the user what the canonical variant is.

plural / singular - should all ingredients be defined in singular or plural. It probably depends what is meant (or left out). If a raw ingredient says minerals or salts, it does not say which mineral or salt. There can be more than one mineral or salt. So in the normalisation plurals should be kept.

split/combine - it is tempting to combine ingredient names that point to the same ingredient. Whether one should do so depends on practical usage of the names. If both the names are used as often, it is better to keep them separate. If one name is obscure, let it be a synonym of the more prevalent name. For instance butterfat and ghee are used as often and should be kept separate, although they are the same ingredient.

ingredient/product - ingredient lists can contains products, or better compound ingredients. A compound ingredient consists of multiple sub-ingredients. This can be listed as: compound ingredient (subingredient 1, subingredient2,...) or compound ingredient: subingredient1, subingredient2, ...;. This structure should be kept intact, so a better normalisation can be done.

translations - for each canonical ingredient-name, translations in other languages might exist. The best is to deduce the translations from the actual ingredient lists. However be careful, the producer might have taken some liberties with these translations. If you think the translation is faulty do not add it. Create two different canonical ingredients and have them point to each other.

explanatory translations - it is tempting to add a description if the translation does not exist. Such a explanatory translation can be part of a class annotation.

Issues

Things we do not understand or are a design issue.

Individual or class - should the canonical ingredient be seen as an individual or class? The normalisation can be seen as a method to classify all ingredient names. Each ingredient name, i.e. en:butter or fr:beurre is an individual, which now mapped onto a single individual en:butter. It all depends how an indivual is mapped onto a class. I assume now that the inference engine does the parsing of the canonical ingredient name. For each canonical name we could define:

Individual: en:Butter 
  Types: Butter
Individual: en:Butter
  Types: English

Ingredient Classes

The classes define the Ingredients, Languages and other concepts used in the ontology.

Named Ingredient Classes

The ingredient classes are the sets that represent the individuals. Each normalised individual will be mapped to a single class. These are the atomic classes (the named classes)

• URI

Each class has a unique URI. We use the canonical ingredient name for this URI plus a namespace prefix, for example: OFF_Named_Butter or OFF_Named_Beurre_Doux (the adjectives are place behind).

* Necessary conditions

Each ingredient class has a necessary and sufficient condition (equivalent class expression) consisting of a single data property (one can also specify multiple data properties here if one does not want to do the normalisation):

Class: Butter  EquivalentTo: hasName: string[pattern "en:butter"]

• Annotations

A class can further be described by annotations, such as:

label - a user friendly name of the class, probably the canonical name of the ingredient with a corresponding language code. It is possible to add names for eah language. These labels can then serve as translation strings.
wikipedia - a link to the corresponding ingredient on Wikipedia. Preferably this link is in english, but other language inks can be added as well
wikidata - the link to the english wikidata entry.
openFoodFacts ingredients - a link to the corresponding world ingredient page on OFF.
- comment - an comment in english can be added to explain how the ingredient is defined.

• Axioms

Each named ingredient should have one or more associated property. This property uniquely defines the ingredient in addition to just the name. This definition can be used to show how it differs from other ingredients and is the basis for the Defined Ingredients. This definition must be based on information gathered elsewhere, such as wikipedia, wikidata or legislation.

It should be attempted to define a Named Ingredient with as few properties as possible, in order to reduce the complexity. The definition can be in the form of:

hasPercentage axioms - these axioms define the amount of a single component of an ingredient. If needed multiple hasPercentage axioms can be combined.

Class: Butter 
   Facts: hasPercentageOfButterfat: 80

The official definition for butter is that it must have at least 80% of butterfat, so this property is enough to define the butters. It might be necessary to add other type of axioms.

Defined Ingredient Classes

It possible to group Named Ingredient classes into other classes, in order to define a set of similar ingredients. It is possible to do this by combining Named Ingredient classes by hand. For instance:

Class: OFF-Butterfat
    EquivalentTo: { Butterfat, Ghee, Concentrated Butter, Clarified Butter }

This approach seems however to be a bit arbitrary. Why are these ingredients combined? What is there commonality? It would be nice to define a set of axioms that can combine these automatically.

The official definition of butterfat (> 99% butterfat) helps to define the class of butterfats:

Class: OFF-Butterfat
   Facts: hasPercentageOfButterfat: >= 99

Note that the class name is OFF-Butterfat, this helps to distinguish it from the Name Ingredient Butterfat.

Object properties

These object properties define what we know about the relation between the classes. These object properties are defined independent of the classes (that is a base idea in OWL). These properties can help the user see how ingredients are related. These relations can be divided into several categories:

isDerived to show how one gets from one ingredient to the next;
hasAdded to show if another ingredient has been added (not sure what the difference with isDerived is);
isRemoved to show if another ingredient has been removed;
contains - to indicate the relation with other classes:

Class: Butter
 isSubClassOf: contains only Butterfat

It is possible to come up with many different object properties. The quetsion is when they are useful and when not.

• Domain / Range

As these object properties are defined outside the classes, it is not upfront clear to what classes they apply The use of domains and ranges helps with this. The domains specify the subject and the range the object, for instance: Butter isDerivedFrom Butterfat. As we will have multiple isDerivedFrom object properties, it is better to name these in full, like:

Class: Butter
    ButterIsDerivedFromButterfat: Butterfat

I cabe up on the domain/range approach, as it is a lot of work. Instead I add the object properties to the class directly. No inference is needed then.

Class: Butter
    SubClassOf: ( isDerivedFrom only Butterfat ) or ( isDerivedFrom only Cream

In this complex one could specifically add the origin for each Named Ingredient Class.

• Inverse relationships

Not sure if these need to be defined. Maybe if the user should be able to go in both directions.

General axioms

Not sure yet how to use these.

Reasoner

The tool Protégé has a reasoner, which checks the correctness of the ontology. It should be applied regularly in order to check any errors.

Language Classes

For each language a separate class can be defined. The name of the class, English Language for instance, is in english and makes clear what the class is about. The annotation data for the language can be taken from Lingvoj by hand. Lingvoj has each language defined as a separate individual. We want to assign each ingredient individual to a specific language class. Each language is a sibling of the superclass Language.

Each language class has a data property, which is necessary and sufficient condition that determines its membership. This data property describes how the language is encoded in an ingredient individual. This encoding consists of the languageCode that corresponds to the language. with the corresponding. For example for an ingredient individual in english:

Class: Butter EquivalentTo: hasName: string[pattern "en:.*"]

This uses a regex-string for decoding.

Origin classes

These classes define the original source of an ingredient, before any processing is done. For instance milk comes from female mammals, cow milk comes from female cows, salt comes from the mineral halite, etc.

The food thesaurus created by LanguaL can be used as a base

Label classes

These classes describe labels that have been assigned to Ingredients.

Geography Classes

These classes describe the geographic origin of classes. These can be added as necessity requires. A hierarchy based on continent and country or region can be added.

Example

The first attempts at an ontology can be found on Github.

Maybe I can make a drawing of a part of the ontology.

Ontology Usage

The goal is to obtain an ontology that can be used by OFF to analyse the ingredients in an ingredient list. One could envisage the following steps:

The ingredient instance/entry/individual is entered into the inference engine
The engine infers a Ingredient Class and its subclasses
The results can be presented to the user and will explain what the class is, how it is related to other Ingredient Classes, how the Ingredient Class is created from other classes and it will show what the ingredient entry does not tell.
The user can select the language in which he wants to see the results.

Categories Taxonomy

OFF maintains a categories taxonomy to categorise products. This categories taxonomy is related to the ingredient ontology, but subtly different. First a product is something that is on sale and has an identifier (barcode). A product will never be part of an ingredients list. You will never see a barcode in the ingredients.

The relationship between product categories and ingredients can be described as: A product with a product name and barcode from a brand belongs to a category and has one or more ingredients.

For example the product Beurre Gastronomique Doux of the brand Milbona with barcode 20139315 has one ingredient: Beurre pasteurisé and belongs to the Sweet cream butters category.

Diet support

The ingredients taxonomy can be used by applications (web and mobile) to filter certain ingredients or ingredient groups for dietary purposes. The existing structure offers already many hooks for simple diets. For more complex ones additional hooks can be added.

Allergens

Currently support for allergens is supported in a separate taxonomy. This allergens taxonomy is however based on ingredients, so there is a large overlap in work. For the different allergens, it is explained how it can be supported by the taxonomy.

Gluten:
Crustaceans: en:crustacean and children.
Egg: en:egg and its children.
Fish: en:fish and children.
Peanuts: en:peanut and en:peanut oil
Soybeans: en:soya, en:soya oil, en:soy flour, en:soy protein, en:tofu, en:soy preparation, en:soy sauce, en:soya lecithin; maybe: E476
Milk: en:dairy, en:lactose, en:milk minerals, en:milk proteins, en:milk chocolate, en:milk filling, en:ice cream, en:bechamel sauce
Nuts: en:nut, en:peanut, en:crocant, en:nougat, en:pistachio seed oil, en:pine nuts, en:walnut oil, en:hazelnut oil, en:pink peppercorn
Celery: en:celery and en:celeraic and children.
Mustard: en:mustard covers all synonyms. Should also some brassica be included? wasabi?
Sesame seeds: en:sesame, en:sesame oil and children.
Sulphur dioxide:
Lupin: en:lupin and its child.
Molluscs: en:molluscs and its children.

Vegetarian

Vegetarianism has many subdivisions as is seen below. The table lists which ingredient (groups) need to be excluded for each diet.

Name	Livestock	Poultry	Game	Seafood	Dairy	Eggs	Honey	Root vegetables
Fruitarianism	en:animal	en:poultry	en:game animal	en:fish, en:shellfish	en:dairy, en:lactose, en:milk minerals, en:milk proteins, en:milk chocolate, en:milk filling, en:ice cream, en:bechamel sauce	en:egg	en:honey, en:royal jelly, en:pollen, en:beeswax, en:marzipan, en:nougat, en:invertase	en:root vegetable, en:ginger, en:betanin, en:turmeric
Jain vegetarianism	en:animal	en:poultry	en:game animal	en:fish, en:shellfish	-	en:egg	en:honey, en:royal jelly, en:pollen, en:beeswax, en:marzipan, en:nougat, en:invertase	en:root vegetable, en:ginger, en:betanin, en:turmeric
Veganism	en:animal	en:poultry	en:game animal	en:fish, en:shellfish	en:dairy, en:lactose, en:milk minerals, en:milk proteins, en:milk chocolate, en:milk filling, en:ice cream, en:bechamel sauce	en:egg	en:honey, en:royal jelly, en:pollen, en:beeswax, en:marzipan, en:nougat, en:invertase	-
Lacto vegetarianism	en:animal	en:poultry	en:game animal	en:fish, en:shellfish	-	en:egg	-	-
Orthodox Fasting	en:animal	en:poultry	en:game animal	sometimes	en:dairy, en:lactose, en:milk minerals, en:milk proteins, en:milk chocolate, en:milk filling, en:ice cream, en:bechamel sauce	en:egg	-	-
Ovo vegetarianism	en:animal	en:poultry	en:game animal	en:fish, en:shellfish	en:dairy, en:lactose, en:milk minerals, en:milk proteins, en:milk chocolate, en:milk filling, en:ice cream, en:bechamel sauce	-	-	-
Ovo-lacto vegetarianism	en:animal	en:poultry	en:game animal	en:fish, en:shellfish	-	-	-	-
Pescetarianism	en:animal	en:poultry	en:game animal	-	sometimes	-	-	-
Pollo-vegetarianism	en:animal	-	en:game animal (yes)	en:fish, en:shellfish	-	-	-	-
Pollo-pescetarianism	en:animal	-	en:game animal (yes)	-	-	-	-	-

For some diets the game animal is permitted, so it is possible to include those ingredients.

Fodmap

Maintenance

This section described how the taxonomy file should be maintained, i.e. adding ingredients, editing ingredients, etc.

Automatic add

Describes how automatically new ingredients can be added from the ingredients available in the OFF-database. @stephane

Add single ingredient

An ingredient is a node in the taxonomy file, which describes a single ingredient as is found in an ingredient list.

Add ingredient name

You can add an ingredient anywhere in the file at first. Keep a blank line between the previous and next ingredient. Add the ingredient in the format: LANGUAGE_CODE:MAIN_INGREDIENT_NAME, SYNONYM, SYNONYM.

Thus start by defining the ingredient name you want to add. If there are multiple ingredients names possible, decide which ingredient name will be the MAIN_INGREDIENT_NAME. The other names can be added as SYNONYM. Each ingredient name is separated by comma's.

Determine the language code to be used. Find your language in this lang list and add the corresponding ISO 639-1 code as LANGUAGE_CODE to your MAIN_INGREDIENT_NAME, separated by a colon (:).

Add translations

You can add translations for an ingredient. Each translation should appear on a new line.

Wikipedia translations

If the ingredient exists as article in Wikipedia, you can add the translations supplied by Wikipedia. You can use the language codes used by Wikipedia (as seen in the article url). You should use the title of the articles, as this corresponds the wikipedia link and might include some disambiguation.

Warning: you can not add the three (or more) letter codes used by Wikipedia. OFF does not support those. However you might add them as comment lines, so we have hem for the future.

Other translations

It is possible to add other languages through an online dictionary. Linguee might help you here if you speak multiple languages. But preferable this should be checked by native speakers.

Untranslatable ingredients

Sometimes an item can not be translated, as it just does not exist. If possible we could add a description instead.

Sort translations

The translations should be sorted in alphabetical order of the language code.

An exception is the first line, that is the default language, which is used, when no translation is present.

Add corresponding wikipedia entry

On a new line you can add the wikipedia article that corresponds the ingredient. The format to use for this line is:
Wikipedia:https://LANGUAGE_CODE.wikipedia.org/wiki/MAIN_INGREDIENT_NAME
So be careful that your main ingredient name is indeed a wikipedia article. This does not always work out as such.

Add a parent ingredient

It is possible to link ingredients based on related properties. If you can answer the question "The ingredient is a kind of other ingredient", then you can probably assign that ingredient as parent. First search if it exists already. If not then you should add the parent ingredient as well to the ontology. And add the parent ingredient in the format: <LANGUAGE_CODE:PARENT_INGREDIENT_NAME

Issues

Language codes

OFF seems to use the short ISO 639-1 language codes. The consequence is that not all languages that are found on Wikipedia can be implemented in OFF. For instance the language Furlan, does not have a ISO 639-1 code, but does have a ISO 639-2 code (fur).

Disambiguation

How should disambiguation be handled?

Product Data

The current taxonomy is based on the strings found in the ingredients lists. This does not include the percentages, nor the organic markup. In addition for a better class assignment, other data from the product is useful (nutritional values, production place).

Scalability

First experiments with classification of raw individual axioms, shows that the reasoner breaks down. This suggest that at least the reasoner is not very scalable. I hope the solution is.

Comments

Stephane @ 2018-08-22

For the ingredients taxonomy, I think it's best if we take a pragmatic and incremental approach:

We currently have 2 uses for the ingredients taxonomy:

1. Compute the % of unknown ingredients so that we can tag products that have ingredient lists that are likely bogus. -> this is currently not very useful as we have tens of thousands of products tagged

2. Compute the NOVA score for transformed products. -> this is in production today, and displayed on the web site and the mobile apps. -> there is an increasing attention on NOVA

It's the NOVA classification that has really made the ingredients taxonomy a reality: it's something that is visible, and we can incrementally improve the ingredients taxonomy to improve the classification. There is a 3rd use that I intend to develop soon:

3. Spellcheck and auto-correct the lists of ingredients. --> if we can make this work reasonably well, this will have a huge impact for OFF and help us to speed adding new products data and rely more on OCR. For those uses, we need:

- for 1 and 3, a list of correctly spelled ingredients, as comprehensive as possible

- for 2, a hierarchy of ingredients and their synonyms used in ingredients list, for ingredients and sub ingredients that are markers of transformed and ultra-transformed products. (edited) There are many other potential uses for the ingredients taxonomy, and some of them require more features like relations, properties etc.

For instance if we want to automatically translate ingredient lists, we need the entries in the taxonomy to have synonyms. That's something we can do incrementally, and we coud deploy partial translation.

If we want to automatically determine is a product is suitable for vegetarians or vegans, it's much more complex: we need to add properties to indicate if an ingredient can come from an animal source, and we need to do that for almost all ingredients: if there's 1 ingredient in the list of ingredients that we don't know about, then we can't determine anything. My preference is to try to take care of simple cases and impactful cases first, and as we do it, it will make the other uses easier to do later as well.

We can focus first on what has an immediate impact, then on what can have an impact soon. In the same line of thinking, I think we should try to find ways to do the most impactful things first.

The good thing is that it's easy to find those things: we just have to look at what we have in the ingredients lists of the products in the database. We sort them by frequency, and we work from top to down: we add them to the taxonomy if they are not there already, we had the parent ingredient if it has one, we try to identify synonyms and translations (using the ingredients lists from OFF) etc.

And as we do that, we can take note of potential issues, things that would be nice to support etc. But first let's focus on the general case. Then we can see how common the specific cases / issues are, what it would take to support them, what we could do if we supported them etc. A few specific points raised in the discussions above: Regarding parents: we should use them exclusively for the "is a" relationship. Not "contains", "derived from" etc. That means also that we don't add parents that are not actual ingredients themselves (e.g. no "milk derived ingredients" or "ingredients that contain milk"). Other relations can be added as properties, but before we do that and complexity the taxonomy, it would be good to think about the use cases, what they entail etc. Regarding synonyms: we should list only synonyms that we have found in ingredient lists, not synonyms from other sources / contexts. That's an issue we have in the additives taxonomy: early on, we added tons of synonyms that are just never used, or used only in other contexts (like the chemical formulas). --> those extra synonyms now hurt us when we do auto-correction of the spelling of additives The first synonym needs to be the most common one in ingredients lists. Regarding the language of the canonical entry: let's just stick with English, it will make everything much easier.