Global taxonomies: Difference between revisions

From Open Food Facts wiki
m (Text replacement - "Global ingredients taxonomy" to "Ingredients taxonomy")
Line 106: Line 106:
'''(on Github, account and VCS knowledge needed)'''
'''(on Github, account and VCS knowledge needed)'''
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/test.txt Test taxonomy] showing the basic taxonomy definition features
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/test.txt Test taxonomy] showing the basic taxonomy definition features
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/ingredients.txt Global ingredients taxonomy]
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/ingredients.txt Ingredients taxonomy]
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/categories.txt Global categories taxonomy]
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/categories.txt Global categories taxonomy]
* [[Global brands and companies taxonomy]]
* [[Global brands and companies taxonomy]]

Revision as of 15:39, 11 June 2020


Open Food Facts uses global taxonomies for fields such as categories, brands, labels and countries. This page explains how taxonomies work in Open Food Facts and how they can be updated and enhanced.

Features

  • A global hierarchy / taxonomy for each type of data field (categories, brands, labels, countries etc.)
  • Translations for every language of each field value
  • Multiple synonyms for each field value in each language
  • Stopwords for each language/field type

Generalities

Languages

Each language has a 2-letter prefix. e.g. "en" for English and "fr" for French.

Whenever possible, the canonical language for each field value should be English. e.g. en:soups is the canonical value for the Soups category. A value can be defined in another language (which becomes the canonical language), e.g. fr:soupes-a-l-oignon could be the canonical value for "Onion Soups" if we don't have an English translation yet.

New values (e.g. categories that do not exist yet) should have an English canonical value.

Each field value can be translated to any language.

When a field value needs to be translated to a target language, if the translation does not exist yet, English is shown (or the canonical language if the English translation does not exist either).

Remarks

  • Which standard is used for the codes? It can be the ISO-639-1 standard, eventually this can be extended the 3-letter codes.

Singular or plural?

Generally, we use the plural for categories but some of them are in singular. We don't put the plural form when it has a different meaning. For example Beef and Beefs. We are talking of the meat and not the animal, so there is no "s". But "Rillettes" in french (and others languages) doesn't have a singular form.

Sometimes plural or singular depends on the language. The situation for dutch is described in Dutch translation issues.

If the category is in plural form, the translations should be in the plural form.

@stephane new proposal (not yet adopted, to be discussed)
In the categories taxonomy, always use the plural (en:beers, fr:bières).
Then add the singular (especially if it's not a simple rule like removing the final s) in a property:
singular:en:beer
singular:fr:bière
For the ingredients taxonomy, do the reverse:
always use the singular for the main entry, and then add the plural:
en:tomato
plural:en:tomatoes

Synonyms

In each language, each value can have a number of synonyms.

Simple synonyms (simple singular) are done automatically when possible.

Synonyms are recursive: if en:yoghurt is a synonym of en:yogurt, then en:banana_yoghurt will automatically be added as a synonym of en:banana_yogurt

Remarks

  • What are the simple synonym rules? How does translate to other languages?
  • Not that recursion does not work for languages, where the adjectives changes based on gender.

Stopwords

Stopwords can be used to further extend synonyms. e.g. if "à" and "la" are stopwords for French, then "Yaourts fraise" will automatically be mapped to "Yaourts à la fraise".

In the ingredients taxonomy, stopwords are also words that can be ignored. For instance contains is not an ingredient.

Taxonomy architecture

The taxonomy is not a strict hierarchy: values can have multiple parents. But cycles are not allowed.

Format

# stopwords
stopwords:en: some,stopwords
stopwords:fr: word,that,are,removed,when,matching

# synonyms that are not field values but that are contained in field values
synonyms:en: global,international

en: value, a synonym value, another synonym value
fr: valeur, une valeur synonyme, une autre valeur synonyme

<en: value
en: a child value, a synonym for a child value
fr: une valeur enfant, un synonyme d'une valeur enfant

<en: value
en: another child value

<en: a child value
<en: another child value
en: a grand-child value

# properties
en: value
fr: valeur
description:en: a property of value
description:fr: french version of the property
country_code:en: a property that is the same for all languages -> use English suffix en:
wikidata:en:Q89

Taxonomies

The definitions can be edited on GitHub, they are periodically synchronized on the Open Food Facts database and web site.

Taxonomies

(on Github, account and VCS knowledge needed)

Draft Taxonomies

More info

For detailed information specific for the ingredients taxonomy see Ingredients Ontology.