Ingredients ontology
Introduction
Why?
Why do we need an ingredients ontology? The ontology describes how ingredients are derived from each other and how ingredients can be combined into new ingredients. An ontology might be useful to:
- Normalise ingredients - Producers take a lot of freedom in describing the ingredients they use. An ontology helps to standardise the ingredients.
- Hidden ingredients - an ingredient might contain hidden ingredients, the ontology might reveal these. For example butter contains butterfat.
- Combined ingredients - an ingredient might appear as a single ingredient. In reality however
- Processed ingredients - often an ingredient is derived from an other ingredient through some process. We can make explicit what these processes are. Example clarified butter is created from butter by separating the milk solids and water from the butterfat.
- Ingredient incompleteness - often an ingredient is incomplete defined in an ingredient list. For instance if an ingredient-list specifies milk, it should be defined from which mammal the milk comes from, for instance cow's milk.
Ingredients Thesaurus
The current OFF ingredients taxonomy (cryptpad) can be seen as a thesaurus.
In the context of information retrieval, a thesaurus (plural: "thesauri") is a form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects. A thesaurus serves to minimise semantic ambiguity by ensuring uniformity and consistency in the storage and retrieval of the manifestations of content objects. ANSI/NISO Z39.19-2005 defines a content object as "any item that is to be described for inclusion in an information retrieval system, website, or other source of information". The thesaurus aids the assignment of preferred terms to convey semantic metadata associated with the content object. (wikipedia)
The purpose of the thesaurus is have a list of ingredients, that occur in ingredient lists, that are unique and are well defined. The OFF ingredients taxonomy is mainly a list of elements that occur in ingredient lists. However the OFF Ingredients taxonomy is more than a simple thesaurus.
Example
Below is the ingredient class entry of soybean in the OFF ingredients taxonomy:
# SOYBEAN (Glycine max), or soya bean is a species of legume native to East Asia,
<en:soya
en:soya bean, soy beans, soya beans, soybean
da:Sojabønne
de:Sojabohne, Sojabohnen, Sojakerne
es:granos de soja, habas de soja
et:Sojauba
fi:Soijapapu
fr:fève de soja, fèves de soja, soja entier, graine de soja, graines de soja, graines soja
Explanation
- Definition - the # describes a comment line and can be used to add a definition of the ingredient. In this case the definition is taken from wikipedia.
- Super-ingredient - the <en:soya line describes the super-ingredient ingredient of this (sub-ingredient) ingredient. The relationship between the super-ingredient an the sub-ingredient is an is-a-kind-of-relationship. This means that the sub-ingredient is a kind of super-ingredient. In this example the soybean ingredient is a kind soy ingredient. The sub-ingredient provides more details of the ingredient. In this case the super-ingredient is very broad (something with soy), wheras the sub-ingredient adds the detail of the bean. This line is optional, if no super-ingredient can be specified.
- Key - The en:soya bean line is the main name (key) of the ingredient. This doubles as name for the ingredient (in this case in english (en:)).
- Translations - The next lines provide translations of the main ingredient in other languages. One language per line. Each line starts with a language prefix. Thus de: means german. This language prefix is the ISO 639-1 code.
- Synonyms - an ingredient in a language can have multiple synonyms. These synonyms appear as secondary entries on a language line. The first entry is however the main name in that language.
Multiple super ingredients
It is possible to specify multiple super ingredients for an ingredient. For instance:
<en:soya
<en:sauce
en:soy sauce
de:Sojasauce
In this case en:soy:sauce is a kind of en:sauce and a kind of en:soya.
Compound ingredients
It is possible to have a ingredient that in reality consists of multiple ingredients. For instance fr:jus de soja can also be specified as en:water plus en:soybean. On ingredients list this usually appears as jus de soja (eau, fèves de soja), i.e. the elements between parentheses define the real ingredients.
Specification
Producers sometimes try to shorten the length of ingredients lists by suppressing repetition. For instance the entry vegetable fats (palm, sunflower) should actually be read as two elements: palm vegetable fat and sunflower vegetable fat. In this case the parentheses act as method to avoid repetition.
Practice
Some experiences have been gathered in building the OFF taxonomy. These can be divided into guiding principles and steps.
Guiding principles
- Ingredient list guides - the entries as found on ingredient lists are the basis and guide the taxonomy. So no entries from other places. (an exception is translation bootstrapping)
- Primary language - the first language of an ingredient should be in english, if it is available in that language. Otherwise use the language where the ingredient occurs the most (at the moment that will probably be french);
- Singular
- Lower case
- No assumption - if in doubt about a translation or an assignment or whatever, keep the ingredient separate. Make no assumption about an ingredient. If it specified vaguely, keep it vaguely;
- Or's - sometime an ingredient is listed as X or Y. This will be entered as <en:X and <en:Y, so there are two superclasses.
Steps
To build the taxonomy from raw data, the follow steps need to be taken
- Gather - the first step is gathering the ingredients from the ingredient lists. This results in a list of raw ingredients. This process needs to be described @stephane does this;
- Assign - the raw ingredients can be mapped unto the existing taxonomy, in order to find ingredients that are new and not yet part of the taxonomy (the new raw ingredients);
- Merge - each new raw ingredient must be looked at and if possible merged with an ingredient in the existing taxonomy. This also involves merging translations;
- Key - if a new raw ingredient can be mapped to the taxonomy, it will probably be a new synonym of an existing entry. Ideally one should check which of the synonyms occurs the most and set that one as main ingredient name;
- Create - if the ingredient is totally new, a new entry can be created. It should be decided what the super ingredient is (if any). It can then be entered in that part of the taxonomy.
- Translate - try to find the translations of the ingredient. Either by searching translations in OFF. For instance fr:polydextrose finds the languages available and try to use the translations offered;
- Define - add a definition from wikipedia and wikidata if available.
- OFF - add a link to the corresponding ingredients page. For instance for fr:polydextrose.
- Occurences - note how often the ingredient occurs, in how many language and at what date that was determined. This might require an advanced search (polydextrose). This might allow us to track if changes are necessary;
Watch out
It is possible to make mistakes in these steps.
- Wrong translation - the ingredient lists in different languages available on products, do not always offer a translations.
- Translation bootstrapping - in order to get up to speed with the translations, we use the Wikipedia and Wikidata to find translations. This can result in the wrong translations as it can be something different than what is found on the package. So be careful for wikipedia disambiguation strings, latin species names, etc. In the long run these should be superseded by what is fond on the products.
Theory
What theory can be used to base an food ingredients taxonomy on? Is there already a food ontology somewhere?
OWL
I get the impression that OWL is the most accepted markup language to exchange semantic networks. I will use the Manchester markup to describe the axioms.
Tools
Discovering ProtĂŠgĂŠ from Stanford at the moment. There are standalone versions and web-versions of the app. Maybe the web-version can be used to maintain the ontology by OFF-users. Unfortunately the web-version does not support all the detailed modelling options.
Existing ontologies
It would be a pity if we have to develop ontologies for all the
Languages
Several language ontologies exist:
- Lingvoj seems to be a languages ontology, which is available in rdf. It is not clear how to download the individual language files.
- The EU has a XML and SKOS download (Language NAL).
- Lexvo (RDF-file can not be loaded by ProtĂŠgĂŠ)
These solutions seem to be overkill for our needs.
Food
Can we use LinguaL in some way? Maybe for inspiration.
DBPedia
This should provide links to wikipedia, short summaries and be multilingual.
Geography
This is needed to get some structure for locations (useful for IGP).
Labels
Are there taxonomies/ontologies for labels?
Issues
It is not easy to convert all aspects of OFF to OWL and ProtĂŠgĂŠ. Some issues that have been encountered:
- Translations
- Inheritance
Ontology building
This section describes how the ontology is created starting from the ingredients found on the ingredient lists of products.
Individuals
The purpose of the ontology is to have a set of axioms between canonical ingredients, which is logically consistent. If a producer puts an ingredient in an ingredient list (the raw ingredient), it will first be normalised to a single canonical ingredient name. This canonical name will define the ingredient class. These canonical ingredient names are the individuals in OWL/ProtĂŠgĂŠ terminology.
Raw ingredients
The creation of the ontology starts with the ingredients as found on the ingredient lists. A raw ingredient is defined as the the text between two ingredient dividers (where are these defined?). This might include any percentage, organic markup, allergen markup, origin and sub-ingredients. In combination with the language used, it gives a first indication of the ingredient class.
Normalized ingredients
The raw ingredients must be transformed to normalized ingredients. A normalized ingredient is a single string in a western script (assumes that the western script can be read by any one in the world). A normalized ingredient is preceded by the language code of the ingredient. Usually this in english if the translation is available. (It is possible that the from language is needed for a better comprehension when classifying ingredients).
Normalizing ingredients implies that any variant of an an ingredient in any language is mapped onto a single name. The normalizing procedure is mainly going through a pattern recognition procedure.
For protĂŠgĂŠ this implies that any normalized ingredient individual must have a data property, which defines the data corresponding to the individual. We encode this as:
Individual: en:Butter Facts: hasName "en:butter" ^^xsd:string
Probably it is possible to bypass this normalisation step, but that makes the encoding in OWL much more extended.
Normalisation procedure
- keep and show - try to keep the original entry as much as possible (remove typo's, etc) and show the user what the canonical variant is.
- plural / singular - should all ingredients be defined in singular or plural. It probably depends what is meant (or left out). If a raw ingredient says minerals or salts, it does not say which mineral or salt. There can be more than one mineral or salt. So in the normalisation plurals should be kept.
- split/combine - it is tempting to combine ingredient names that point to the same ingredient. Whether one should do so depends on practical usage of the names. If both the names are used as often, it is better to keep them separate. If one name is obscure, let it be a synonym of the more prevalent name. For instance butterfat and ghee are used as often and should be kept separate, although they are the same ingredient.
- ingredient/product - ingredient lists can contains products, or better compound ingredients. A compound ingredient consists of multiple sub-ingredients. This can be listed as: compound ingredient (subingredient 1, subingredient2,...) or compound ingredient: subingredient1, subingredient2, ...;. This structure should be kept intact, so a better normalisation can be done.
- translations - for each canonical ingredient-name, translations in other languages might exist. The best is to deduce the translations from the actual ingredient lists. However be careful, the producer might have taken some liberties with these translations. If you think the translation is faulty do not add it. Create two different canonical ingredients and have them point to each other.
- explanatory translations - it is tempting to add a description if the translation does not exist. Such a explanatory translation can be part of a class annotation.
Issues
Things we do not understand or are a design issue.
- Individual or class - should the canonical ingredient be seen as an individual or class? The normalisation can be seen as a method to classify all ingredient names. Each ingredient name, i.e. en:butter or fr:beurre is an individual, which now mapped onto a single individual en:butter. It all depends how an indivual is mapped onto a class. I assume now that the inference engine does the parsing of the canonical ingredient name. For each canonical name we could define:
Individual: en:Butter Types: Butter Individual: en:Butter Types: English
Ingredient Classes
The classes define the Ingredients, Languages and other concepts used in the ontology.
Named Ingredient Classes
The ingredient classes are the sets that represent the individuals. Each normalised individual will be mapped to a single class. These are the atomic classes (the named classes)
⢠URI
Each class has a unique URI. We use the canonical ingredient name for this URI plus a namespace prefix, for example: OFF_Named_Butter or OFF_Named_Beurre_Doux (the adjectives are place behind).
* Necessary conditions
Each ingredient class has a necessary and sufficient condition (equivalent class expression) consisting of a single data property (one can also specify multiple data properties here if one does not want to do the normalisation):
Class: Butter EquivalentTo: hasName: string[pattern "en:butter"]
⢠Annotations
A class can further be described by annotations, such as:
- label - a user friendly name of the class, probably the canonical name of the ingredient with a corresponding language code. It is possible to add names for eah language. These labels can then serve as translation strings.
- wikipedia - a link to the corresponding ingredient on Wikipedia. Preferably this link is in english, but other language inks can be added as well
- wikidata - the link to the english wikidata entry.
- openFoodFacts ingredients - a link to the corresponding world ingredient page on OFF.
- comment - an comment in english can be added to explain how the ingredient is defined.
⢠Axioms
Each named ingredient should have one or more associated property. This property uniquely defines the ingredient in addition to just the name. This definition can be used to show how it differs from other ingredients and is the basis for the Defined Ingredients. This definition must be based on information gathered elsewhere, such as wikipedia, wikidata or legislation.
It should be attempted to define a Named Ingredient with as few properties as possible, in order to reduce the complexity. The definition can be in the form of:
- hasPercentage axioms - these axioms define the amount of a single component of an ingredient. If needed multiple hasPercentage axioms can be combined.
Class: Butter Facts: hasPercentageOfButterfat: 80
The official definition for butter is that it must have at least 80% of butterfat, so this property is enough to define the butters. It might be necessary to add other type of axioms.
Defined Ingredient Classes
It possible to group Named Ingredient classes into other classes, in order to define a set of similar ingredients. It is possible to do this by combining Named Ingredient classes by hand. For instance:
Class: OFF-Butterfat EquivalentTo: { Butterfat, Ghee, Concentrated Butter, Clarified Butter }
This approach seems however to be a bit arbitrary. Why are these ingredients combined? What is there commonality? It would be nice to define a set of axioms that can combine these automatically.
The official definition of butterfat (> 99% butterfat) helps to define the class of butterfats:
Class: OFF-Butterfat Facts: hasPercentageOfButterfat: >= 99
Note that the class name is OFF-Butterfat, this helps to distinguish it from the Name Ingredient Butterfat.
Object properties
These object properties define what we know about the relation between the classes. These object properties are defined independent of the classes (that is a base idea in OWL). These properties can help the user see how ingredients are related. These relations can be divided into several categories:
- isDerived to show how one gets from one ingredient to the next;
- hasAdded to show if another ingredient has been added (not sure what the difference with isDerived is);
- isRemoved to show if another ingredient has been removed;
- contains - to indicate the relation with other classes:
Class: Butter isSubClassOf: contains only Butterfat
It is possible to come up with many different object properties. The quetsion is when they are useful and when not.
⢠Domain / Range
As these object properties are defined outside the classes, it is not upfront clear to what classes they apply The use of domains and ranges helps with this. The domains specify the subject and the range the object, for instance: Butter isDerivedFrom Butterfat. As we will have multiple isDerivedFrom object properties, it is better to name these in full, like:
Class: Butter ButterIsDerivedFromButterfat: Butterfat
I cabe up on the domain/range approach, as it is a lot of work. Instead I add the object properties to the class directly. No inference is needed then.
Class: Butter SubClassOf: ( isDerivedFrom only Butterfat ) or ( isDerivedFrom only Cream
In this complex one could specifically add the origin for each Named Ingredient Class.
⢠Inverse relationships
Not sure if these need to be defined. Maybe if the user should be able to go in both directions.
General axioms
Not sure yet how to use these.
Reasoner
The tool ProtĂŠgĂŠ has a reasoner, which checks the correctness of the ontology. It should be applied regularly in order to check any errors.
Language Classes
For each language a separate class can be defined. The name of the class, English Language for instance, is in english and makes clear what the class is about. The annotation data for the language can be taken from Lingvoj by hand. Lingvoj has each language defined as a separate individual. We want to assign each ingredient individual to a specific language class. Each language is a sibling of the superclass Language.
Each language class has a data property, which is necessary and sufficient condition that determines its membership. This data property describes how the language is encoded in an ingredient individual. This encoding consists of the languageCode that corresponds to the language. with the corresponding. For example for an ingredient individual in english:
Class: Butter EquivalentTo: hasName: string[pattern "en:.*"]
This uses a regex-string for decoding.
Origin classes
These classes define the original source of an ingredient, before any processing is done. For instance milk comes from female mammals, cow milk comes from female cows, salt comes from the mineral halite, etc.
The food thesaurus created by LanguaL can be used as a base
Label classes
These classes describe labels that have been assigned to Ingredients.
Geography Classes
These classes describe the geographic origin of classes. These can be added as necessity requires. A hierarchy based on continent and country or region can be added.
Example
The first attempts at an ontology can be found on Github.
Maybe I can make a drawing of a part of the ontology.
Ontology Usage
The goal is to obtain an ontology that can be used by OFF to analyse the ingredients in an ingredient list. One could envisage the following steps:
- The ingredient instance/entry/individual is entered into the inference engine
- The engine infers a Ingredient Class and its subclasses
- The results can be presented to the user and will explain what the class is, how it is related to other Ingredient Classes, how the Ingredient Class is created from other classes and it will show what the ingredient entry does not tell.
- The user can select the language in which he wants to see the results.
Categories Taxonomy
OFF maintains a categories taxonomy to categorise products. This categories taxonomy is related to the ingredient ontology, but subtly different. First a product is something that is on sale and has an identifier (barcode). A product will never be part of an ingredients list. You will never see a barcode in the ingredients.
The relationship between product categories and ingredients can be described as: A product with a product name and barcode from a brand belongs to a category and has one or more ingredients.
For example the product Beurre Gastronomique Doux of the brand Milbona with barcode 20139315 has one ingredient: Beurre pasteurisĂŠ and belongs to the Sweet cream butters category.
Allergens taxonomy
OFF uses an allergens taxonomy/thesaurus to detect/describe allergens. We need to indicate the relation with this ontology.
Maintenance
This section described how the taxonomy file should be maintained, i.e. adding ingredients, editing ingredients, etc.
Automatic add
Describes how automatically new ingredients can be added from the ingredients available in the OFF-database. @stephane
Add single ingredient
An ingredient is a node in the taxonomy file, which describes a single ingredient as is found in an ingredient list.
Add ingredient name
You can add an ingredient anywhere in the file at first. Keep a blank line between the previous and next ingredient. Add the ingredient in the format: LANGUAGE_CODE:MAIN_INGREDIENT_NAME, SYNONYM, SYNONYM.
Thus start by defining the ingredient name you want to add. If there are multiple ingredients names possible, decide which ingredient name will be the MAIN_INGREDIENT_NAME. The other names can be added as SYNONYM. Each ingredient name is separated by comma's.
Determine the language code to be used. Find your language in this lang list and add the corresponding ISO 639-1 code as LANGUAGE_CODE to your MAIN_INGREDIENT_NAME, separated by a colon (:).
Add translations
You can add translations for an ingredient. Each translation should appear on a new line.
- Wikipedia translations
If the ingredient exists as article in Wikipedia, you can add the translations supplied by Wikipedia. You can use the language codes used by Wikipedia (as seen in the article url). You should use the title of the articles, as this corresponds the wikipedia link and might include some disambiguation.
Warning: you can not add the three (or more) letter codes used by Wikipedia. OFF does not support those. However you might add them as comment lines, so we have hem for the future.
- Other translations
It is possible to add other languages through an online dictionary. Linguee might help you here if you speak multiple languages. But preferable this should be checked by native speakers.
- Untranslatable ingredients
Sometimes an item can not be translated, as it just does not exist. If possible we could add a description instead.
Sort translations
The translations should be sorted in alphabetical order of the language code.
An exception is the first line, that is the default language, which is used, when no translation is present.
Add corresponding wikipedia entry
On a new line you can add the wikipedia article that corresponds the ingredient. The format to use for this line is:
Wikipedia:https://LANGUAGE_CODE.wikipedia.org/wiki/MAIN_INGREDIENT_NAME
So be careful that your main ingredient name is indeed a wikipedia article. This does not always work out as such.
Add a parent ingredient
It is possible to link ingredients based on related properties. If you can answer the question "The ingredient is a kind of other ingredient", then you can probably assign that ingredient as parent. First search if it exists already. If not then you should add the parent ingredient as well to the ontology. And add the parent ingredient in the format:
<LANGUAGE_CODE:PARENT_INGREDIENT_NAME
Issues
Language codes
OFF seems to use the short ISO 639-1 language codes. The consequence is that not all languages that are found on Wikipedia can be implemented in OFF. For instance the language Furlan, does not have a ISO 639-1 code, but does have a ISO 639-2 code (fur).
Disambiguation
How should disambiguation be handled?
Product Data
The current taxonomy is based on the strings found in the ingredients lists. This does not include the percentages, nor the organic markup. In addition for a better class assignment, other data from the product is useful (nutritional values, production place).
Scalability
First experiments with classification of raw individual axioms, shows that the reasoner breaks down. This suggest that at least the reasoner is not very scalable. I hope the solution is.
Comments
Stephane @ 2018-08-22
For the ingredients taxonomy, I think it's best if we take a pragmatic and incremental approach:
We currently have 2 uses for the ingredients taxonomy:
1. Compute the % of unknown ingredients so that we can tag products that have ingredient lists that are likely bogus. -> this is currently not very useful as we have tens of thousands of products tagged
2. Compute the NOVA score for transformed products. -> this is in production today, and displayed on the web site and the mobile apps. -> there is an increasing attention on NOVA
It's the NOVA classification that has really made the ingredients taxonomy a reality: it's something that is visible, and we can incrementally improve the ingredients taxonomy to improve the classification. There is a 3rd use that I intend to develop soon:
3. Spellcheck and auto-correct the lists of ingredients. --> if we can make this work reasonably well, this will have a huge impact for OFF and help us to speed adding new products data and rely more on OCR. For those uses, we need:
- for 1 and 3, a list of correctly spelled ingredients, as comprehensive as possible
- for 2, a hierarchy of ingredients and their synonyms used in ingredients list, for ingredients and sub ingredients that are markers of transformed and ultra-transformed products. (edited) There are many other potential uses for the ingredients taxonomy, and some of them require more features like relations, properties etc.
For instance if we want to automatically translate ingredient lists, we need the entries in the taxonomy to have synonyms. That's something we can do incrementally, and we coud deploy partial translation.
If we want to automatically determine is a product is suitable for vegetarians or vegans, it's much more complex: we need to add properties to indicate if an ingredient can come from an animal source, and we need to do that for almost all ingredients: if there's 1 ingredient in the list of ingredients that we don't know about, then we can't determine anything. My preference is to try to take care of simple cases and impactful cases first, and as we do it, it will make the other uses easier to do later as well.
We can focus first on what has an immediate impact, then on what can have an impact soon. In the same line of thinking, I think we should try to find ways to do the most impactful things first.
The good thing is that it's easy to find those things: we just have to look at what we have in the ingredients lists of the products in the database. We sort them by frequency, and we work from top to down: we add them to the taxonomy if they are not there already, we had the parent ingredient if it has one, we try to identify synonyms and translations (using the ingredients lists from OFF) etc.
And as we do that, we can take note of potential issues, things that would be nice to support etc. But first let's focus on the general case. Then we can see how common the specific cases / issues are, what it would take to support them, what we could do if we supported them etc. A few specific points raised in the discussions above: Regarding parents: we should use them exclusively for the "is a" relationship. Not "contains", "derived from" etc. That means also that we don't add parents that are not actual ingredients themselves (e.g. no "milk derived ingredients" or "ingredients that contain milk"). Other relations can be added as properties, but before we do that and complexity the taxonomy, it would be good to think about the use cases, what they entail etc. Regarding synonyms: we should list only synonyms that we have found in ingredient lists, not synonyms from other sources / contexts. That's an issue we have in the additives taxonomy: early on, we added tons of synonyms that are just never used, or used only in other contexts (like the chemical formulas). --> those extra synonyms now hurt us when we do auto-correction of the spelling of additives The first synonym needs to be the most common one in ingredients lists. Regarding the language of the canonical entry: let's just stick with English, it will make everything much easier.