Global taxonomies: Difference between revisions

From Open Food Facts wiki
No edit summary
Β 
(42 intermediate revisions by 3 users not shown)
Line 2: Line 2:
'''Open Food Facts''' uses multiple '''taxonomies''' for its inner workings. These taxonomies run from simple word lists with their translations, to complex hierarchies with relations between entries and attributes. The taxonomies are used to analyse and structure the information found on products.
'''Open Food Facts''' uses multiple '''taxonomies''' for its inner workings. These taxonomies run from simple word lists with their translations, to complex hierarchies with relations between entries and attributes. The taxonomies are used to analyse and structure the information found on products.


This page gives an overview of the taxonomies and some guiding principles that are valid for all taxonomies.
The [[Languages taxonomy]] for instance lists the most important languages in the world, translated into these languages. The [[Ingredients taxonomy]] lists all the ingredients that OFF has discovered on food products.


== Overview ==
This page gives an overview of the taxonomies and some guiding principles that are valid for all taxonomies. For each specific taxonomy there is a specific wikipage.
The list of taxonomies in use is:
* ''Additives classes taxonomy''
* ''Brands taxonomy'' for the producers of products and their corresponding brands. This taxonomy is not in use and still under construction (5-oct-2022);
* ''Categories taxonomy'' used to put similar products together, so that they can be compared, averages calculated and analysed as a group. For instance an orange juice of a specific brand can be compared to all orange juices;
* ''Countries taxonomy'' has a list of all countries, regions, areas, etc. in the world;
* ''Ingredients taxonomy'' for ingredients found on product ingredients lists. Some specific ingredients have been put into separate taxonomies, as these need special handling due to legal requirements:
** ''Additives taxonomy''
** ''Allergens taxonomy''
** ''Amino acids taxonomy''
** ''Minerals taxonomy''
** ''Nucleotides taxonomy''
** ''Other nutritional substances taxonomy''
** ''Vitamins taxonomy''
* ''[[Labels taxonomy]]'' for every logo and claim by producer about product quality, supply chain, sourcing, diets, etc.;
* ''[[Languages taxonomy]]'' with a list of world languages in multiple languages;
* ''NOVA groups taxonomy''
* ''Packaging/recycling taxonomies'' have been spread out over several specific taxonomies:
** ''Materials taxonomy'' with a list of all possible materials that can be used;
** ''Shapes taxonomy'' which describe all the packaging parts the packaging can consist of;
** ''Recycling taxonomy'' with all ways packaging can be recycled;
* ''States taxonomy'' which describe how complete the data of a product on OFF is;


== Presentations ==
== Presentations ==
There are several other presentation and descriptions on taxonomies, which you might to look at, before diving into the details:
There are several other presentation and descriptions on taxonomies, which you might to look at, before diving into the details:
* [https://docs.google.com/presentation/d/1zL2fA3d_fuPvKKKmCGJ5aV_Ug-R-gXQbwSCcxf3dbo4/edit?usp=sharing Quick presentation of the taxonomies]
* [https://docs.google.com/presentation/d/1zL2fA3d_fuPvKKKmCGJ5aV_Ug-R-gXQbwSCcxf3dbo4/edit?usp=sharing Quick presentation of the taxonomies]
* [https://www.youtube.com/watch?v=b0xRwU_De9Q Lightning talk about taxonomies by LΓ©onore] (Open Food Facts days 2023) and [https://docs.google.com/presentation/d/1_QPKTCyDXNK6HgwHQ0W-ebYCfkB3U3NcL0xLQJJZ9TU/edit#slide=id.p the slides]
* A [https://yuktea.wordpress.com/2022/06/22/three-generations/ short intro by Yukti on her blog].
* A [https://yuktea.wordpress.com/2022/06/22/three-generations/ short intro by Yukti on her blog].


Line 43: Line 23:


=== Generic synonyms ===
=== Generic synonyms ===
What is the role of these in parsing?
These are words that can be added to multiple names, but have not much meaning. This allows the system to filter them out so that you no not have to introduce them everywhere.


=== Term Section ===
=== Terms ===
The main part of a taxonomy is a (very) large list of term sections. A term section consists of list of terms in different languages and a list of properties.
The main part of a taxonomy is a (very) large list of terms. A term consists of list of names in different languages, an optional parent and a list of properties.


==== Term ====
==== Name ====
A term can be a specific ingredient, label, language, etc, depending on the taxonomy.
A name can be a specific ingredient, label, language, etc, depending on the taxonomy.


A term consists of two parts: a language code (see the [[Languages taxonomy]]) and one or more values. The language code determines the language of the values that follow. Thus one can find the value '''Authentic Trappist Product''' in the English language in the [[Labels taxonomy]].
A name consists of two parts: a language code (see the [[Languages taxonomy]]) and one or more values. The language code determines the language of the values that follow. Thus one can find the value '''Authentic Trappist Product''' in the English language in the [[Labels taxonomy]].


If a term can be translated to another language, a new term can be added for that language. For instance for the English name '''Banana''' you might want to add the name in Afrikaans ('''Piesang''') or Amharic ('''αˆ™α‹''').
If a name can be translated to another language, a new name can be added for that language. For instance for the English name '''Banana''' you might want to add the name in Afrikaans ('''Piesang''') or Amharic ('''αˆ™α‹''').


==== Canonical term/value ====
==== Canonical term/value ====
Each term section identifies a primary language that is used to identify the term section This is the canonical name for the term section. Preferable this is the English term. Β 
Each term identifies a primary language that is used to identify the term. This is the canonical name for the term. Preferable this is the English name. Β 


==== Synonyms ====
==== Synonyms ====
For each term it is possible to add synonyms. These will be used during ingredient recognition for instance. However only the main term name will be used. For instance the Spanish translation for '''Banana''' is '''PlΓ‘tano''', and the synonym '''Banana''' has been added.
For each name it is possible to add synonyms. These will be used during ingredient recognition for instance. However only the main name will be used. For instance the Spanish translation for the english name '''Banana''' is '''PlΓ‘tano''', to which the synonym '''Banana''' has been added.


Simple synonyms (simple singular) are done automatically when possible.
Simple synonyms (simple singular) are done automatically when possible.


Synonyms are recursive: if '''Yoghurt''' is a synonym of '''Yogurt''', then '''Banana yoghurt''' will automatically be added as a synonym of '''Banana yogurt'''.
Synonyms are recursive: if '''Yoghurt''' is a synonym of '''Yogurt''', then '''Banana yoghurt''' will automatically be added as a synonym of '''Banana yogurt'''.
==== Parents ====
Each term can have one (or more) parent terms. This allows to define an isa-relation between terms. For instance '''Fruit''' is the parent of '''Banana'''. The term '''Whole black olives''' in the categories taxonomy has the parent terms '''Black olives''' and '''Whole olives'''.


==== Properties ====
==== Properties ====
In addition to terms, on or more property can be added to a term section. The following properties are supported:
In addition to names, one or more properties can be added to a term section. Β 
* ''Description'' - a short (three lines) description of the term section;
* ''Image (logo)'' - this allows to add an image/logo related to the term
* ''Wikidata'' - a link to the term on Wikidata;
* ''Wikipedia'' a link the term with its equivalent on Wikipedia;


== Implementation ==
For a full listing of the properties that are used see [[Taxonomy Properties]].
he following properties are supported:
Each specific taxonomy might use additional properties.


TBD: the following texts will be updated and/or placed in separate pages. @aleene 5-10-2022
=== Structure ===
The taxonomy is not a strict hierarchy: sections can have multiple parents. But cycles are not allowed. Formally it can be seen as a [https://en.wikipedia.org/wiki/Directed_acyclic_graph Directed Acyclic Graph].


For instance in the categories taxonomy the category '''Apricots''', can have the parent '''Fresh Food''' and the parent '''Fruit'''.


==== Languages ====
== Implementation ==
Each language has a 2-letter prefix. e.g. "en" for English and "fr" for French.
Each taxonomy is implemented as simple text file. See [[Taxonomy implementation]]


A value can be defined in another language (which becomes the canonical language), e.g. <code>fr:soupes-a-l-oignon</code> could be the canonical value for "Onion Soups" if we don't have an English translation yet.
== Maintenance ==
Anyone can contribute to the maintenance of these taxonomies. You might want to add a translation, add synonyms, new entries or parents. Have a look at [[Taxonomy Maintenance]].


New values (e.g. categories that do not exist yet) should have an English canonical value.
== Taxonomy editor ==
We have [https://ui.taxonomy.openfoodfacts.org/ a new tool to allow editing taxonomies]. The source code and bug report/feature requests are [https://github.com/openfoodfacts/taxonomy-editor located on GitHub]. We would welcome feedback on it.


Each field value can be translated to any language.
== Overview ==
Β 
The list of taxonomies in use is (these are the files in the [https://github.com/openfoodfacts/openfoodfacts-server/tree/main/taxonomies taxonomies folder] on Github:
When a field value needs to be translated to a target language, if the translation does not exist yet, English is shown (or the canonical language if the English translation does not exist either).
* ''Additives classes taxonomy''
Β 
* ''Brands taxonomy'' for the producers of products and their corresponding brands. This taxonomy is not in use and still under construction (5-oct-2022);
==== Singular or plural? ====
* ''Categories taxonomy'' used to put similar products together, so that they can be compared, averages calculated and analysed as a group. For instance an orange juice of a specific brand can be compared to all orange juices;
Generally, we use the plural for categories but some of them are in singular. We don't put the plural form when it has a different meaning. For example Beef and Beefs. We are talking of the meat and not the animal, so there is no "s". But "Rillettes" in french (and others languages) doesn't have a singular form.
* ''Countries taxonomy'' has a list of all countries, regions, areas, etc. in the world;
Β 
* ''Ingredients taxonomy'' for ingredients found on product ingredients lists. Some specific ingredients have been put into separate taxonomies, as these need special handling due to legal requirements:
Sometimes plural or singular depends on the language. The situation for dutch is described in [[Dutch translation issues]].
** ''Additives taxonomy''
Β 
** ''Allergens taxonomy''
If the category is in plural form, the translations should be in the plural form.
** ''Amino acids taxonomy''
Β 
** ''[[Ingredients processing taxonomy]]''
<pre>
** ''Minerals taxonomy''
@stephane new proposal (not yet adopted, to be discussed)
** ''Nucleotides taxonomy''
In the categories taxonomy, always use the plural (en:beers, fr:bières).
** ''Other nutritional substances taxonomy''
Then add the singular (especially if it's not a simple rule like removing the final s) in a property:
** ''Vitamins taxonomy''
singular:en:beer
* ''[[Labels taxonomy]]'' for every logo and claim by producer about product quality, supply chain, sourcing, diets, etc.;
singular:fr:bière
* ''[[Languages taxonomy]]'' with a list of world languages in multiple languages;
For the ingredients taxonomy, do the reverse:
* ''NOVA groups taxonomy''
always use the singular for the main entry, and then add the plural:
* [[Origins taxonomy]]
en:tomato
* ''Packaging/recycling taxonomies'' have been spread out over several specific taxonomies:
plural:en:tomatoes
** ''Materials taxonomy'' with a list of all possible materials that can be used;
</pre>
** ''Shapes taxonomy'' which describe all the packaging parts the packaging can consist of;
Β 
** ''Recycling taxonomy'' with all ways packaging can be recycled;
=== Wikidata ===
* ''States taxonomy'' which describe how complete the data of a product on OFF is;
The <code>wikidata:</code> prefix allows to link the term with its equivalent on Wikidata database. Example:
fr:Label Rouge
xx:Label Rouge
wikidata:en:Q3214309
The description is reused on Open Food Facts website. Example: https://world.openfoodfacts.org/label/label-rouge
Β 
=== Opposite ===
"we use "opposite" for imports (e.g. when there is a column "Organic" with values like "No")
Β 
en:Non-fair trade, Not fair trade
Β 
fr:Non issu du commerce Γ©quitable
Β 
opposite:en: en:fair-trade"
Β 
=== Unused properties ===
Some people have added more properties in the different taxonomies. They are not used for the moment, but they can lead to interesting usages:
Β 
* give more information for the contributor who is editing the taxonomy
* prepare new usages or features
Β 
Here is a list of properties already used in the taxonomies:
Β 
* wikipedia:
* country:
* label_categories:
* eu_groups:
* auth_name:
* auth_address:
* auth_url:
* exceptions:
Β 
=== Taxonomy architecture ===
The taxonomy is not a strict hierarchy: values can have multiple parents. But cycles are not allowed.
Β 
=== Finding new opportunities for taxonomization ===
* A good trick to find candidates for taxonomization is
Β  * https://world.openfoodfacts.org/categories?filter=-
Β  * https://world.openfoodfacts.org/labels?filter=-
Β  * https://world.openfoodfacts.org/origins?filter=-
Β  * https://world.openfoodfacts.org/ingredients?filter=-
* Everything in italics is up for grabs
Β 
== Format ==
Β 
<pre>
# stopwords
stopwords:en: some,stopwords
stopwords:fr: word,that,are,removed,when,matching
Β 
# synonyms that are not field values but that are contained in field values
synonyms:en: global,international
Β 
en: value, a synonym value, another synonym value
fr: valeur, une valeur synonyme, une autre valeur synonyme
Β 
<en: value
en: a child value, a synonym for a child value
fr: une valeur enfant, un synonyme d'une valeur enfant
Β 
<en: value
en: another child value
Β 
<en: a child value
<en: another child value
en: a grand-child value
Β 
# properties
en: value
fr: valeur
description:en: a property of value
description:fr: french version of the property
country_code:en: a property that is the same for all languages -> use English suffix en:
wikidata:en:Q89
Β 
</pre>
Β 
== Getting taxonomies files ==
Β 
The definitions can be edited on [https://github.com/openfoodfacts/openfoodfacts-server/tree/master/taxonomies GitHub], they are periodically synchronized on the Open Food Facts database and web site.
Β 
=== Raw taxonomies ===
'''(on Github, account and VCS knowledge needed)'''
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/test.txt Test taxonomy] showing the basic taxonomy definition features
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/ingredients.txt Ingredients taxonomy]
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/categories.txt Global categories taxonomy]
* [[Global brands and companies taxonomy]]
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/labels.txt Global labels taxonomy]
* [[Global labels taxonomy logos]]
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/languages.txt Global languages taxonomy]
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/countries.txt Global countries taxonomy]
* [[Global origins taxonomy]]
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/additives.txt Global additives taxonomy]
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/additives_classes.txt Global additives classes taxonomy]
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/vitamins.txt Global vitamins taxonomy]
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/minerals.txt Global minerals taxonomy]
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/amino_acids.txt Global amino acids taxonomy]
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/nucleotides.txt Global nucleotides taxonomy]
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/other_nutritional_substances.txt Global other nutritional substances taxonomy]
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/allergens.txt Global allergens taxonomy]
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/states.txt Global states taxonomy]
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/nova_groups.txt Global NOVA groups taxonomy]
Β 
=== Taxonomies in JSON format ===
Every taxonomy is also available as a json on OpenFoodFacts. Use
* <code><nowiki>https://static.openfoodfacts.org/data/taxonomies/</nowiki><taxonomy name>.json</code>
* or <code><nowiki>https://static.openfoodfacts.org/data/taxonomies/</nowiki><taxonomy name>.full.json</code> to get additional properties
Β 
Example for categories: Β 
* https://static.openfoodfacts.org/data/taxonomies/categories.json
* https://static.openfoodfacts.org/data/taxonomies/categories-full.json
Β 
=== Taxonomy API ===
We have a basic taxonomy API to get information about a taxonomy.
Β 
Eg. to get information about ''en:carrots'' entry in ''categories'' taxonomy:
Β 
https://world.openfoodfacts.org/api/v2/taxonomy?tagtype=categories&tags=en:carrots
Β 
you can add ''lc'' (language code) parameter to ask for more than one language. ''cc'' is used for country code.
Β 
Eg. to add french information on previous request:
Β 
[https://world.openfoodfacts.org/api/v2/taxonomy?tagtype=categories&tags=en:carrots&lc=en%2Cfr&cc=fr https://world.openfoodfacts.org/api/v2/taxonomy?tagtype=categories&tags=en:carrots&lc=en,fr&cc=fr]


== Ideas ==
If you have any ideas for taxonomies, contact us on Slack and describe a proposal on this wiki.
=== Draft Taxonomies ===
=== Draft Taxonomies ===
* [[Global packaging taxonomy]]
There are some ideas of creating other taxonomies:
* [[Global stores taxonomy]]
* [[Global stores taxonomy]]
* [[Global Religious Certification taxonomy]]
* [[Global Religious Certification taxonomy]]
Line 240: Line 103:
* [[Global IGP taxonomy]]
* [[Global IGP taxonomy]]
* [[Global EC marks taxonomy]]
* [[Global EC marks taxonomy]]
* [[Diets taxonomy]]


== Building and deploying taxonomies ==
== Access ==
The taxonomies can be viewed on Github, via the products on the website and with an API. This is detailed on a separate page:
[[Taxonomy access]].
[[Category:Taxonomies]]


Changes to taxonomies on GitHub are not deployed instantly, the need to be built, deployed, and products need to be re-processed with the new taxonomy.


* [[How to build and deploy taxonomies]]
== Get in touch ==
Β 
{{Box
== More info ==
| 1Β  Β  =Β  Slack channel
For detailed information specific for the ingredients taxonomy see [[Project:Ingredients ontology|Ingredients Ontology]].
| 2Β  Β  =Β  [https://openfoodfacts.slack.com/messages/C02VDSWHT/ #taxonomies]
[[Category:Taxonomies]]
}}

Latest revision as of 08:46, 8 August 2024

Introduction

Open Food Facts uses multiple taxonomies for its inner workings. These taxonomies run from simple word lists with their translations, to complex hierarchies with relations between entries and attributes. The taxonomies are used to analyse and structure the information found on products.

The Languages taxonomy for instance lists the most important languages in the world, translated into these languages. The Ingredients taxonomy lists all the ingredients that OFF has discovered on food products.

This page gives an overview of the taxonomies and some guiding principles that are valid for all taxonomies. For each specific taxonomy there is a specific wikipage.

Presentations

There are several other presentation and descriptions on taxonomies, which you might to look at, before diving into the details:

Principles

There are some general principles that are valid for all taxonomies. A taxonomy consists of terms and stopwords.

Stopwords

Stopwords are words that can be ignored. These words can be found among ingredients, for instance, but are not ingredients themselves. Each language can have its own stopwords.

In the ingredients taxonomy, for instance, contains is not an ingredient.

Stopwords can be used to further extend synonyms. e.g. if "Γ " and "la" are stopwords for French, then "Yaourts fraise" will automatically be mapped to "Yaourts Γ  la fraise".

Generic synonyms

These are words that can be added to multiple names, but have not much meaning. This allows the system to filter them out so that you no not have to introduce them everywhere.

Terms

The main part of a taxonomy is a (very) large list of terms. A term consists of list of names in different languages, an optional parent and a list of properties.

Name

A name can be a specific ingredient, label, language, etc, depending on the taxonomy.

A name consists of two parts: a language code (see the Languages taxonomy) and one or more values. The language code determines the language of the values that follow. Thus one can find the value Authentic Trappist Product in the English language in the Labels taxonomy.

If a name can be translated to another language, a new name can be added for that language. For instance for the English name Banana you might want to add the name in Afrikaans (Piesang) or Amharic (αˆ™α‹).

Canonical term/value

Each term identifies a primary language that is used to identify the term. This is the canonical name for the term. Preferable this is the English name.

Synonyms

For each name it is possible to add synonyms. These will be used during ingredient recognition for instance. However only the main name will be used. For instance the Spanish translation for the english name Banana is PlΓ‘tano, to which the synonym Banana has been added.

Simple synonyms (simple singular) are done automatically when possible.

Synonyms are recursive: if Yoghurt is a synonym of Yogurt, then Banana yoghurt will automatically be added as a synonym of Banana yogurt.

Parents

Each term can have one (or more) parent terms. This allows to define an isa-relation between terms. For instance Fruit is the parent of Banana. The term Whole black olives in the categories taxonomy has the parent terms Black olives and Whole olives.

Properties

In addition to names, one or more properties can be added to a term section.

For a full listing of the properties that are used see Taxonomy Properties. he following properties are supported: Each specific taxonomy might use additional properties.

Structure

The taxonomy is not a strict hierarchy: sections can have multiple parents. But cycles are not allowed. Formally it can be seen as a Directed Acyclic Graph.

For instance in the categories taxonomy the category Apricots, can have the parent Fresh Food and the parent Fruit.

Implementation

Each taxonomy is implemented as simple text file. See Taxonomy implementation

Maintenance

Anyone can contribute to the maintenance of these taxonomies. You might want to add a translation, add synonyms, new entries or parents. Have a look at Taxonomy Maintenance.

Taxonomy editor

We have a new tool to allow editing taxonomies. The source code and bug report/feature requests are located on GitHub. We would welcome feedback on it.

Overview

The list of taxonomies in use is (these are the files in the taxonomies folder on Github:

  • Additives classes taxonomy
  • Brands taxonomy for the producers of products and their corresponding brands. This taxonomy is not in use and still under construction (5-oct-2022);
  • Categories taxonomy used to put similar products together, so that they can be compared, averages calculated and analysed as a group. For instance an orange juice of a specific brand can be compared to all orange juices;
  • Countries taxonomy has a list of all countries, regions, areas, etc. in the world;
  • Ingredients taxonomy for ingredients found on product ingredients lists. Some specific ingredients have been put into separate taxonomies, as these need special handling due to legal requirements:
    • Additives taxonomy
    • Allergens taxonomy
    • Amino acids taxonomy
    • Ingredients processing taxonomy
    • Minerals taxonomy
    • Nucleotides taxonomy
    • Other nutritional substances taxonomy
    • Vitamins taxonomy
  • Labels taxonomy for every logo and claim by producer about product quality, supply chain, sourcing, diets, etc.;
  • Languages taxonomy with a list of world languages in multiple languages;
  • NOVA groups taxonomy
  • Origins taxonomy
  • Packaging/recycling taxonomies have been spread out over several specific taxonomies:
    • Materials taxonomy with a list of all possible materials that can be used;
    • Shapes taxonomy which describe all the packaging parts the packaging can consist of;
    • Recycling taxonomy with all ways packaging can be recycled;
  • States taxonomy which describe how complete the data of a product on OFF is;

Ideas

If you have any ideas for taxonomies, contact us on Slack and describe a proposal on this wiki.

Draft Taxonomies

There are some ideas of creating other taxonomies:

Access

The taxonomies can be viewed on Github, via the products on the website and with an API. This is detailed on a separate page: Taxonomy access.


Get in touch

Slack channel