Ingredients ontology

From Open Food Facts wiki

Introduction

Why?

Why do we need an ingredients ontology? The ontology describes how ingredients are derived from each other and how ingredients can be combined into new ingredients. An ontology might be useful to:

  • Normalise ingredients - Producers take a lot of freedom in describing the ingredients they use. An ontology helps to standardise the ingredients.
  • Hidden ingredients - an ingredient might contain hidden ingredients, the ontology might reveal these. For example butter contains butterfat.
  • Combined ingredients - an ingredient might appear as a single ingredient. In reality however
  • Processed ingredients - often an ingredient is derived from an other ingredient through some process. We can make explicit what these processes are. Example clarified butter is created from butter by separating the milk solids and water from the butterfat.
  • Ingredient incompleteness - often an ingredient is incomplete defined in an ingredient list. For instance if an ingredient-list specifies milk, it should be defined from which mammal the milk comes from, for instance cow's milk.

Theory

What theory can be used to base an food ingredients taxonomy on? Is there already a food ontology somewhere?

OWL

I get the impression that OWL is the most accepted markup language to exchange semantic networks.

Tools

Discovering Protégé from Stanford at the moment. There are standalone versions and web-versions of the app. Maybe the web-version can be used to maintain the ontology by OFF-users.

Existing ontologies

It would be a pity if we have to develop ontologies for all the

Languages

Several language ontologies exist:

  • Lingvoj seems to be a languages ontology, which is available in rdf. It is not clear how to download the individual language files.
  • The EU has a XML and SKOS download (Language NAL).
  • Lexvo (RDF-file can not be loaded by Protégé)

These solutions seem to be overkill for our needs.

Food

Can we use LinguaL in some way? Maybe for inspiration.

DBPedia

This should provide links to wikipedia, short summaries and be multilingual.

Geography

This is needed to get some structure for locations (useful for IGP).

Labels

Are there taxonomies/ontologies for labels?

Issues

It is not easy to convert all aspects of OFF to OWL and Protégé. Some issues that have been encountered:

  • Translations
  • Inheritance

Ontology building

This section describes how the ontology is put together.

Individuals

The purpose of the ontology is to define a set of relationships between standardised ingredients. Thus if a producer defines an ingredient in an ingredient list (raw ingredient), it will first be normalised to a single language and format and then it can be decided (inferenced) which ingredient it is. If an ingredient was only defined in vague terms, the relationships will indicate the missing information and assumptions that can be made about an ingredient.

Raw ingredients

The creation of the ontology starts with the ingredients as found on the ingredient lists. A raw ingredient is defined as the the text between two ingredient dividers (where are these defined?). This might include any percentage, organic markup, allergen markup, origin and sub-ingredients. In combination with the language used, it gives a first indication of the ingredient class.

Normalized ingredients

The raw ingredients must be transformed to normalized ingredients. A normalized ingredient is a single string in a western script (assumes that the western script can be read by any one in the world). A normalized ingredient is preceded by the language code of the ingredient. Usually this in english if the translation is available. (It is possible that the from language is needed for a better comprehension when classifying ingredients).

Normalizing ingredients implies that any variant of an an ingredient in any language is mapped onto a single name. The normalizing procedure is mainly going through a pattern recognition procedure.

For protégé this implies that any normalized ingredient individual must have a data property, which defines the data corresponding to the individual. The data property hasName valueString is used to infer the corresponding classes. It is assumed that the valueString has the format languageCode:string. This data property can then be used to classify an ingredient. It is possible that the other information of the product is required for a better classification.

Normalisation procedure

  • plural / singular - should all ingredients be defined in singular or plural. It probably depends what is meant (or left out). If a raw ingredient says minerals or salts, it does not say which mineral or salt. There can be more than one mineral or salt. So in the normalisation plurals should be kept.

Classes

Languages

For each language a class is defined. The name of the class, English Language for instance, is in english and makes clear the what it is about. The annotation data for the language can be taken from lingvoj by hand. Each language has an equivalent to entry, which relates the language with the corresponding languageCode. This entry has the format: hasName only xsd:string[pattern "languageCode:.*"]. Each language is a sibling of the superclass language.

Ingredient

An ingredient individual corresponds to a class in the ontology. A class consists of:

  • name - written as OFF_INGREDIENT. The name is in principle in english. (what to do if there does not exist an translation?)
  • annotations:
    • label - describing the class. One for each language. The names as used by wikipedia can be used. Often the label is similar
    • wikipedia - a link to the english page of the corresponding ingredient on Wikipedia.
    • wikidata - the link to the english wikidata entry.
    • openFoodFacts ingredient - a world link to the corresponding ingredient page on OFF.
    • comment - an comment in english can be added to explain the decisions made.
  • equivalent class expression - this can be used to relate the class-name to the instance-name. It is by this that the inference works. It is not clear if this should be used to create equivalent classes.

Producers take a lot of freedom in describing the ingredients they use. This implies that an approach is needed to standardise the ingredients that are found in the ingredients list.

  • Main ingredient name - an ingredient might appear under different names, the synonyms. Of these multiple names one will be chosen as main ingredient name. The main ingredient name will be defined in a single language.
  • Synonyms - any synonym will be a separate entry
  • Translations - an ingredient can be translated in multiple languages. However one must be careful that one ingredient is the really the same in another language. Legislation or actual production processes can be different. It might be possible to alert the user to such cases. Not all ingredients will be translatable in all languages
  • Compound ingredients - sometimes an ingredient list will contain a compound ingredient, i.e. an ingredient (product?) that consists of other ingredients.

Super classes

Are these needed? What should they mean?

Object properties

The relationships should define how the ingredients are related to each other. An ingredient can be created from an other ingredient by applying some transformation process. This transformation process could remove one of the sub-ingredients. Or the transformation process could change one sub-ingredient into another ingredient.

Formal relationships

  • contains - describes if an ingredient contains another ingredient. The relationship could specify the fraction of the ingredient. For example butter contains 80% butterfat, pastry butter contains 99.8% butterfat;
  • removes - this transformation process removes a sub-ingredient. For example pastry butter is created from butter by removing 20% water;
  • is melted - process whereby the ingredient is made fluid (melted butter)
  • isa - describes a detailed specification of an ingredient
  • is produced in - describes the location where the ingredient is created. This can be a geographic location (beurre d'isigny aop) or in a type of factory (beurre laitier)

Issues

It seems a lot of data is entered double. I hope these will solved as I understand things more.

Example

Maybe I can make a drawing of a part of the ontology.

Ontology Usage

The goal is to obtain an ontology that can be used by OFF to analyse the ingredients in an ingredient list. One could envisage the following steps:

  1. The ingredient instance/entry/individual is entered into the inference engine
  2. The engine infers a Ingredient Class and its subclasses
  3. The results can be presented to the user and will explain what the class is, how it is related to other Ingredient Classes, how the Ingredient Class is created from other classes and it will show what the ingredient entry does not tell.
  4. The user can select the language in which he wants to see the results.

OFF Ingredients Taxonomy

The ontology should be usable as the translations taxonomy. This taxonomy lists all ingredients, their synonyms and their translations. This taxonomy is already in use.

Object properties

The object properties describe the relations between classes.

The following object properties have been identified:

Contains/isPartOf

These properties allow to indicate whether an ingredient is composed of other ingredients. The relations are only defined if they clarify the definition of an ingredient class.

isDerived

This is a class of properties that describe how one ingredient class is transformed into another ingredient class.

isManufactureIn/manufactures

This describes the relation with a geographic place class.

Example

# CLARIFIED BUTTER - en:milk fat rendered from butter to separate the en:milk solids and water from the en:butterfat

<en:butter
en:clarified butter
bxr:Шара тоһон
ca:mantega clarificada
cs:přepuštěné máslo
de:Butterschmalz

Explanation

  • The # describes a comment line and can be used to add a definition of the ingredient. In this case the definition is taken from wikipedia.
  • The <en:butter line describes the parent ingredient of this ingredient. The parent ingredient forms the basis of the current ingredient. This line is optional. It implies that any properties of the parent are also valid for the child, unless a property has been redefined.
  • The en:clarified butter line is the main name of the ingredient. Any synonyms appear after the main name, separated by comma's. The prefix en: defines the language of the main ingredient.
  • The next lines provide translations of the main ingredient in other languages. One language per line. Each line starts with a language prefix. Thus de: means german.

Categories Taxonomy

OFF maintains a categories taxonomy to categorise products. This categories taxonomy is related to the ingredient ontology, but subtly different. First a product is something that is on sale and has an identifier (barcode). A product will never be part of an ingredients list. You will never see a barcode in the ingredients.

The relationship between product categories and ingredients can be described as: A product with a product name and barcode from a brand belongs to a category and has one or more ingredients.

For example the product Beurre Gastronomique Doux of the brand Milbona with barcode 20139315 has one ingredient: Beurre pasteurisé and belongs to the Sweet cream butters category.

Allergens taxonomy

OFF uses an allergens taxonomy/thesaurus to detect/describe allergens. We need to indicate the relation with this ontology.

Maintenance

This section described how the taxonomy file should be maintained, i.e. adding ingredients, editing ingredients, etc.

Automatic add

Describes how automatically new ingredients can be added from the ingredients available in the OFF-database. @stephane

Add single ingredient

An ingredient is a node in the taxonomy file, which describes a single ingredient as is found in an ingredient list.

Add ingredient name

You can add an ingredient anywhere in the file at first. Keep a blank line between the previous and next ingredient. Add the ingredient in the format: LANGUAGE_CODE:MAIN_INGREDIENT_NAME, SYNONYM, SYNONYM.

Thus start by defining the ingredient name you want to add. If there are multiple ingredients names possible, decide which ingredient name will be the MAIN_INGREDIENT_NAME. The other names can be added as SYNONYM. Each ingredient name is separated by comma's.

Determine the language code to be used. Find your language in this lang list and add the corresponding ISO 639-1 code as LANGUAGE_CODE to your MAIN_INGREDIENT_NAME, separated by a colon (:).

Add translations

You can add translations for an ingredient. Each translation should appear on a new line.

  • Wikipedia translations

If the ingredient exists as article in Wikipedia, you can add the translations supplied by Wikipedia. You can use the language codes used by Wikipedia (as seen in the article url). You should use the title of the articles, as this corresponds the wikipedia link and might include some disambiguation.

Warning: you can not add the three (or more) letter codes used by Wikipedia. OFF does not support those. However you might add them as comment lines, so we have hem for the future.

  • Other translations

It is possible to add other languages through an online dictionary. Linguee might help you here if you speak multiple languages. But preferable this should be checked by native speakers.

  • Untranslatable ingredients

Sometimes an item can not be translated, as it just does not exist. If possible we could add a description instead.

Sort translations

The translations should be sorted in alphabetical order of the language code.

An exception is the first line, that is the default language, which is used, when no translation is present.

Add corresponding wikipedia entry

On a new line you can add the wikipedia article that corresponds the ingredient. The format to use for this line is:
Wikipedia:https://LANGUAGE_CODE.wikipedia.org/wiki/MAIN_INGREDIENT_NAME
So be careful that your main ingredient name is indeed a wikipedia article. This does not always work out as such.

Add a parent ingredient

It is possible to link ingredients based on related properties. If you can answer the question "The ingredient is a kind of other ingredient", then you can probably assign that ingredient as parent. First search if it exists already. If not then you should add the parent ingredient as well to the ontology. And add the parent ingredient in the format: <LANGUAGE_CODE:PARENT_INGREDIENT_NAME

Issues

Language codes

OFF seems to use the short ISO 639-1 language codes. The consequence is that not all languages that are found on Wikipedia can be implemented in OFF. For instance the language Furlan, does not have a ISO 639-1 code, but does have a ISO 639-2 code (fur).

Disambiguation

How should disambiguation be handled?

Product Data

The current taxonomy is based on the strings found in the ingredients lists. This does not include the percentages, nor the organic markup. In addition for a better class assignment, other data from the product is useful (nutritional values, production place).

Scalability

First experiments with classification of raw individual axioms, shows that the reasoner breaks down. This suggest that at least the reasoner is not very scalable. I hope the solution is.