Global taxonomies: Difference between revisions
Line 87: | Line 87: | ||
* ''[[Languages taxonomy]]'' with a list of world languages in multiple languages; | * ''[[Languages taxonomy]]'' with a list of world languages in multiple languages; | ||
* ''NOVA groups taxonomy'' | * ''NOVA groups taxonomy'' | ||
* ''Origins taxonomy'' | |||
* ''Packaging/recycling taxonomies'' have been spread out over several specific taxonomies: | * ''Packaging/recycling taxonomies'' have been spread out over several specific taxonomies: | ||
** ''Materials taxonomy'' with a list of all possible materials that can be used; | ** ''Materials taxonomy'' with a list of all possible materials that can be used; |
Revision as of 14:24, 2 August 2024
Introduction
Open Food Facts uses multiple taxonomies for its inner workings. These taxonomies run from simple word lists with their translations, to complex hierarchies with relations between entries and attributes. The taxonomies are used to analyse and structure the information found on products.
The Languages taxonomy for instance lists the most important languages in the world, translated into these languages. The Ingredients taxonomy lists all the ingredients that OFF has discovered on food products.
This page gives an overview of the taxonomies and some guiding principles that are valid for all taxonomies. For each specific taxonomy there is a specific wikipage.
Presentations
There are several other presentation and descriptions on taxonomies, which you might to look at, before diving into the details:
- Quick presentation of the taxonomies
- Lightning talk about taxonomies by Léonore (Open Food Facts days 2023) and the slides
- A short intro by Yukti on her blog.
Principles
There are some general principles that are valid for all taxonomies. A taxonomy consists of terms and stopwords.
Stopwords
Stopwords are words that can be ignored. These words can be found among ingredients, for instance, but are not ingredients themselves. Each language can have its own stopwords.
In the ingredients taxonomy, for instance, contains is not an ingredient.
Stopwords can be used to further extend synonyms. e.g. if "à" and "la" are stopwords for French, then "Yaourts fraise" will automatically be mapped to "Yaourts à la fraise".
Generic synonyms
These are words that can be added to multiple names, but have not much meaning. This allows the system to filter them out so that you no not have to introduce them everywhere.
Terms
The main part of a taxonomy is a (very) large list of terms. A term consists of list of names in different languages, an optional parent and a list of properties.
Name
A name can be a specific ingredient, label, language, etc, depending on the taxonomy.
A name consists of two parts: a language code (see the Languages taxonomy) and one or more values. The language code determines the language of the values that follow. Thus one can find the value Authentic Trappist Product in the English language in the Labels taxonomy.
If a name can be translated to another language, a new name can be added for that language. For instance for the English name Banana you might want to add the name in Afrikaans (Piesang) or Amharic (ሙዝ).
Canonical term/value
Each term identifies a primary language that is used to identify the term. This is the canonical name for the term. Preferable this is the English name.
Synonyms
For each name it is possible to add synonyms. These will be used during ingredient recognition for instance. However only the main name will be used. For instance the Spanish translation for the english name Banana is Plátano, to which the synonym Banana has been added.
Simple synonyms (simple singular) are done automatically when possible.
Synonyms are recursive: if Yoghurt is a synonym of Yogurt, then Banana yoghurt will automatically be added as a synonym of Banana yogurt.
Parents
Each term can have one (or more) parent terms. This allows to define an isa-relation between terms. For instance Fruit is the parent of Banana. The term Whole black olives in the categories taxonomy has the parent terms Black olives and Whole olives.
Properties
In addition to names, one or more properties can be added to a term section.
For a full listing of the properties that are used see Taxonomy Properties. he following properties are supported: Each specific taxonomy might use additional properties.
Structure
The taxonomy is not a strict hierarchy: sections can have multiple parents. But cycles are not allowed. Formally it can be seen as a Directed Acyclic Graph.
For instance in the categories taxonomy the category Apricots, can have the parent Fresh Food and the parent Fruit.
Implementation
Each taxonomy is implemented as simple text file. See Taxonomy implementation
Maintenance
Anyone can contribute to the maintenance of these taxonomies. You might want to add a translation, add synonyms, new entries or parents. Have a look at Taxonomy Maintenance.
Taxonomy editor
We have a new tool to allow editing taxonomies. The source code and bug report/feature requests are located on GitHub. We would welcome feedback on it.
Overview
The list of taxonomies in use is (these are the files in the taxonomies folder on Github:
- Additives classes taxonomy
- Brands taxonomy for the producers of products and their corresponding brands. This taxonomy is not in use and still under construction (5-oct-2022);
- Categories taxonomy used to put similar products together, so that they can be compared, averages calculated and analysed as a group. For instance an orange juice of a specific brand can be compared to all orange juices;
- Countries taxonomy has a list of all countries, regions, areas, etc. in the world;
- Ingredients taxonomy for ingredients found on product ingredients lists. Some specific ingredients have been put into separate taxonomies, as these need special handling due to legal requirements:
- Additives taxonomy
- Allergens taxonomy
- Amino acids taxonomy
- Ingredients processing taxonomy
- Minerals taxonomy
- Nucleotides taxonomy
- Other nutritional substances taxonomy
- Vitamins taxonomy
- Labels taxonomy for every logo and claim by producer about product quality, supply chain, sourcing, diets, etc.;
- Languages taxonomy with a list of world languages in multiple languages;
- NOVA groups taxonomy
- Origins taxonomy
- Packaging/recycling taxonomies have been spread out over several specific taxonomies:
- Materials taxonomy with a list of all possible materials that can be used;
- Shapes taxonomy which describe all the packaging parts the packaging can consist of;
- Recycling taxonomy with all ways packaging can be recycled;
- States taxonomy which describe how complete the data of a product on OFF is;
Ideas
If you have any ideas for taxonomies, contact us on Slack and describe a proposal on this wiki.
Draft Taxonomies
There are some ideas of creating other taxonomies:
- Global stores taxonomy
- Global Religious Certification taxonomy
- Global Food Preparation taxonomy (related to Project:Microwave)
- Global IGP taxonomy
- Global EC marks taxonomy
- Diets taxonomy
Access
The taxonomies can be viewed on Github, via the products on the website and with an API. This is detailed on a separate page: Taxonomy access.