Global taxonomies: Difference between revisions
(→Format) |
|||
Line 84: | Line 84: | ||
Each taxonomy is implemented as simple text file. | Each taxonomy is implemented as simple text file. | ||
=== Formatting === | |||
An example taxonomy file is presented below to illustrate the formatting. | |||
<pre> | |||
# stopwords | |||
stopwords:en: some,stopwords | |||
stopwords:fr: word,that,are,removed,when,matching | |||
# synonyms that are not field values but that are contained in field values | |||
synonyms:en: global, international | |||
en: value, a synonym value, another synonym value | |||
fr: valeur, une valeur synonyme, une autre valeur synonyme | |||
<en: value | |||
en: a child value, a synonym for a child value | |||
fr: une valeur enfant, un synonyme d'une valeur enfant | |||
<en: value | |||
en: another child value | |||
<en: a child value | |||
<en: another child value | |||
en: a grand-child value | |||
# properties | |||
en: value | |||
fr: valeur | |||
description:en: a property of value | |||
description:fr: french version of the property | |||
country_code:en: a property that is the same for all languages -> use English suffix en: | |||
wikidata:en:Q89 | |||
</pre> | |||
==== Languages ==== | ==== Languages ==== |
Revision as of 14:39, 6 October 2022
Introduction
Open Food Facts uses multiple taxonomies for its inner workings. These taxonomies run from simple word lists with their translations, to complex hierarchies with relations between entries and attributes. The taxonomies are used to analyse and structure the information found on products.
This page gives an overview of the taxonomies and some guiding principles that are valid for all taxonomies.
Overview
The list of taxonomies in use is:
- Additives classes taxonomy
- Brands taxonomy for the producers of products and their corresponding brands. This taxonomy is not in use and still under construction (5-oct-2022);
- Categories taxonomy used to put similar products together, so that they can be compared, averages calculated and analysed as a group. For instance an orange juice of a specific brand can be compared to all orange juices;
- Countries taxonomy has a list of all countries, regions, areas, etc. in the world;
- Ingredients taxonomy for ingredients found on product ingredients lists. Some specific ingredients have been put into separate taxonomies, as these need special handling due to legal requirements:
- Additives taxonomy
- Allergens taxonomy
- Amino acids taxonomy
- Minerals taxonomy
- Nucleotides taxonomy
- Other nutritional substances taxonomy
- Vitamins taxonomy
- Labels taxonomy for every logo and claim by producer about product quality, supply chain, sourcing, diets, etc.;
- Languages taxonomy with a list of world languages in multiple languages;
- NOVA groups taxonomy
- Packaging/recycling taxonomies have been spread out over several specific taxonomies:
- Materials taxonomy with a list of all possible materials that can be used;
- Shapes taxonomy which describe all the packaging parts the packaging can consist of;
- Recycling taxonomy with all ways packaging can be recycled;
- States taxonomy which describe how complete the data of a product on OFF is;
Presentations
There are several other presentation and descriptions on taxonomies, which you might to look at, before diving into the details:
Principles
There are some general principles that are valid for all taxonomies. A taxonomy consists of terms and stopwords.
Stopwords
Stopwords are words that can be ignored. These words can be found among ingredients, for instance, but are not ingredients themselves. Each language can have its own stopwords.
In the ingredients taxonomy, for instance, contains is not an ingredient.
Stopwords can be used to further extend synonyms. e.g. if "à" and "la" are stopwords for French, then "Yaourts fraise" will automatically be mapped to "Yaourts à la fraise".
Generic synonyms
What is the role of these in parsing?
Terms
The main part of a taxonomy is a (very) large list of terms. A term consists of list of names in different languages, an optional parent and a list of properties.
Name
A name can be a specific ingredient, label, language, etc, depending on the taxonomy.
A name consists of two parts: a language code (see the Languages taxonomy) and one or more values. The language code determines the language of the values that follow. Thus one can find the value Authentic Trappist Product in the English language in the Labels taxonomy.
If a name can be translated to another language, a new name can be added for that language. For instance for the English name Banana you might want to add the name in Afrikaans (Piesang) or Amharic (ሙዝ).
Canonical term/value
Each term identifies a primary language that is used to identify the term. This is the canonical name for the term. Preferable this is the English name.
Synonyms
For each name it is possible to add synonyms. These will be used during ingredient recognition for instance. However only the main name will be used. For instance the Spanish translation for the english name Banana is Plátano, to which the synonym Banana has been added.
Simple synonyms (simple singular) are done automatically when possible.
Synonyms are recursive: if Yoghurt is a synonym of Yogurt, then Banana yoghurt will automatically be added as a synonym of Banana yogurt.
Parents
Each term can have one (or more) parent terms. This allows to define an isa-relation between terms. For instance Fruit is the parent of Banana. The term Whole black olives in the categories taxonomy has the parent terms Black olives and Whole olives.
Properties
In addition to names, on or more properties can be added to a term section. The following properties are supported:
- Description - a short (three lines) description of the term section;
- Image (logo) - this allows to add an image/logo related to the term;
- Opposite - to specify the opposite relation to another term, i.e. organic versus non-organic;
- Wikidata - a link to the term on Wikidata;
- Wikipedia a link the term with its equivalent on Wikipedia;
Each specific taxonomy might use additional properties.
Structure
Each term section The taxonomy is not a strict hierarchy: values can have multiple parents. But cycles are not allowed.
Implementation
Each taxonomy is implemented as simple text file.
Formatting
An example taxonomy file is presented below to illustrate the formatting.
# stopwords stopwords:en: some,stopwords stopwords:fr: word,that,are,removed,when,matching # synonyms that are not field values but that are contained in field values synonyms:en: global, international en: value, a synonym value, another synonym value fr: valeur, une valeur synonyme, une autre valeur synonyme <en: value en: a child value, a synonym for a child value fr: une valeur enfant, un synonyme d'une valeur enfant <en: value en: another child value <en: a child value <en: another child value en: a grand-child value # properties en: value fr: valeur description:en: a property of value description:fr: french version of the property country_code:en: a property that is the same for all languages -> use English suffix en: wikidata:en:Q89
Languages
Each language has a 2-letter prefix. e.g. "en" for English and "fr" for French.
A value can be defined in another language (which becomes the canonical language), e.g. fr:soupes-a-l-oignon
could be the canonical value for "Onion Soups" if we don't have an English translation yet.
New values (e.g. categories that do not exist yet) should have an English canonical value.
When a field value needs to be translated to a target language, if the translation does not exist yet, English is shown (or the canonical language if the English translation does not exist either).
Ideas
Draft Taxonomies
There are some ideas of creating other taxonomies:
- Global stores taxonomy
- Global Religious Certification taxonomy
- Global Food Preparation taxonomy (related to Project:Microwave)
- Global IGP taxonomy
- Global EC marks taxonomy
TBD: the following texts will be updated and/or placed in separate pages. @aleene 5-10-2022
Singular or plural?
Generally, we use the plural for categories but some of them are in singular. We don't put the plural form when it has a different meaning. For example Beef and Beefs. We are talking of the meat and not the animal, so there is no "s". But "Rillettes" in french (and others languages) doesn't have a singular form.
Sometimes plural or singular depends on the language. The situation for dutch is described in Dutch translation issues.
If the category is in plural form, the translations should be in the plural form.
@stephane new proposal (not yet adopted, to be discussed) In the categories taxonomy, always use the plural (en:beers, fr:bières). Then add the singular (especially if it's not a simple rule like removing the final s) in a property: singular:en:beer singular:fr:bière For the ingredients taxonomy, do the reverse: always use the singular for the main entry, and then add the plural: en:tomato plural:en:tomatoes
Getting taxonomies files
Finding new opportunities for taxonomization
- A good trick to find candidates for taxonomization is
* https://world.openfoodfacts.org/categories?filter=- * https://world.openfoodfacts.org/labels?filter=- * https://world.openfoodfacts.org/origins?filter=- * https://world.openfoodfacts.org/ingredients?filter=-
- Everything in italics is up for grabs
Taxonomies in JSON format
Every taxonomy is also available as a json on OpenFoodFacts. Use
https://static.openfoodfacts.org/data/taxonomies/<taxonomy name>.json
- or
https://static.openfoodfacts.org/data/taxonomies/<taxonomy name>.full.json
to get additional properties
Example for categories:
- https://static.openfoodfacts.org/data/taxonomies/categories.json
- https://static.openfoodfacts.org/data/taxonomies/categories-full.json
Taxonomy API
We have a basic taxonomy API to get information about a taxonomy.
Eg. to get information about en:carrots entry in categories taxonomy:
https://world.openfoodfacts.org/api/v2/taxonomy?tagtype=categories&tags=en:carrots
you can add lc (language code) parameter to ask for more than one language. cc is used for country code.
Eg. to add french information on previous request:
https://world.openfoodfacts.org/api/v2/taxonomy?tagtype=categories&tags=en:carrots&lc=en,fr&cc=fr
Building and deploying taxonomies
Changes to taxonomies on GitHub are not deployed instantly, the need to be built, deployed, and products need to be re-processed with the new taxonomy.
More info
For detailed information specific for the ingredients taxonomy see Ingredients Ontology.