Global taxonomies: Difference between revisions

From Open Food Facts wiki
Line 74: Line 74:


When a field value needs to be translated to a target language, if the translation does not exist yet, English is shown (or the canonical language if the English translation does not exist either).
When a field value needs to be translated to a target language, if the translation does not exist yet, English is shown (or the canonical language if the English translation does not exist either).
===== Remarks =====
* Which standard is used for the codes?  It can be the [https://en.wikipedia.org/wiki/ISO_639-1 ISO-639-1] standard, eventually this can be extended the 3-letter codes.


==== Singular or plural? ====
==== Singular or plural? ====

Revision as of 12:09, 6 October 2022

Introduction

Open Food Facts uses multiple taxonomies for its inner workings. These taxonomies run from simple word lists with their translations, to complex hierarchies with relations between entries and attributes. The taxonomies are used to analyse and structure the information found on products.

This page gives an overview of the taxonomies and some guiding principles that are valid for all taxonomies.

Overview

The list of taxonomies in use is:

  • Additives classes taxonomy
  • Brands taxonomy for the producers of products and their corresponding brands. This taxonomy is not in use and still under construction (5-oct-2022);
  • Categories taxonomy used to put similar products together, so that they can be compared, averages calculated and analysed as a group. For instance an orange juice of a specific brand can be compared to all orange juices;
  • Countries taxonomy has a list of all countries, regions, areas, etc. in the world;
  • Ingredients taxonomy for ingredients found on product ingredients lists. Some specific ingredients have been put into separate taxonomies, as these need special handling due to legal requirements:
    • Additives taxonomy
    • Allergens taxonomy
    • Amino acids taxonomy
    • Minerals taxonomy
    • Nucleotides taxonomy
    • Other nutritional substances taxonomy
    • Vitamins taxonomy
  • Labels taxonomy for every logo and claim by producer about product quality, supply chain, sourcing, diets, etc.;
  • Languages taxonomy with a list of world languages in multiple languages;
  • NOVA groups taxonomy
  • Packaging/recycling taxonomies have been spread out over several specific taxonomies:
    • Materials taxonomy with a list of all possible materials that can be used;
    • Shapes taxonomy which describe all the packaging parts the packaging can consist of;
    • Recycling taxonomy with all ways packaging can be recycled;
  • States taxonomy which describe how complete the data of a product on OFF is;

Presentations

There are several other presentation and descriptions on taxonomies, which you might to look at, before diving into the details:

Principles

There are some general principles that are valid for all taxonomies. A taxonomy consists of terms and stopwords.

Stopwords

Stopwords are words that can be ignored. These words can be found among ingredients, for instance, but are not ingredients themselves. Each language can have its own stopwords.

Generic synonyms

What is the role of these in parsing?

Term Section

The main part of a taxonomy is a (very) large list of term sections. A term section consists of list of terms in different languages and a list of properties.

Term

A term can be a specific ingredient, label, language, etc, depending on the taxonomy.

A term consists of two parts: a language code (see the Languages taxonomy) and one or more values. The language code determines the language of the values that follow. Thus one can find the value Authentic Trappist Product in the English language in the Labels taxonomy.

If a term can be translated to another language, a new term can be added for that language. For instance for the English name Banana you might want to add the name in Afrikaans (Piesang) or Amharic (ሙዝ).

Canonical term/value

Each term section identifies a primary language that is used to identify the term section This is the canonical name for the term section. Preferable this is the English term.

Synonyms

For each term it is possible to add synonyms. These will be used during ingredient recognition for instance. However only the main term name will be used. For instance the Spanish translation for Banana is Plátano, and the synonym Banana has been added.

Properties

Implementation

TBD: the following texts will be updated and/or placed in separate pages. @aleene 5-10-2022


Languages

Each language has a 2-letter prefix. e.g. "en" for English and "fr" for French.

A value can be defined in another language (which becomes the canonical language), e.g. fr:soupes-a-l-oignon could be the canonical value for "Onion Soups" if we don't have an English translation yet.

New values (e.g. categories that do not exist yet) should have an English canonical value.

Each field value can be translated to any language.

When a field value needs to be translated to a target language, if the translation does not exist yet, English is shown (or the canonical language if the English translation does not exist either).

Singular or plural?

Generally, we use the plural for categories but some of them are in singular. We don't put the plural form when it has a different meaning. For example Beef and Beefs. We are talking of the meat and not the animal, so there is no "s". But "Rillettes" in french (and others languages) doesn't have a singular form.

Sometimes plural or singular depends on the language. The situation for dutch is described in Dutch translation issues.

If the category is in plural form, the translations should be in the plural form.

@stephane new proposal (not yet adopted, to be discussed)
In the categories taxonomy, always use the plural (en:beers, fr:bières).
Then add the singular (especially if it's not a simple rule like removing the final s) in a property:
singular:en:beer
singular:fr:bière
For the ingredients taxonomy, do the reverse:
always use the singular for the main entry, and then add the plural:
en:tomato
plural:en:tomatoes

Synonyms

In each language, each value can have a number of synonyms.

Simple synonyms (simple singular) are done automatically when possible.

Synonyms are recursive: if en:yoghurt is a synonym of en:yogurt, then en:banana_yoghurt will automatically be added as a synonym of en:banana_yogurt.

Remarks

  • What are the simple synonym rules? How does translate to other languages?
  • Not that recursion does not work for languages, where the adjectives change based on gender.

Stopwords

Stopwords can be used to further extend synonyms. e.g. if "à" and "la" are stopwords for French, then "Yaourts fraise" will automatically be mapped to "Yaourts à la fraise".

In the ingredients taxonomy, stopwords are also words that can be ignored. For instance contains is not an ingredient.

Description

The description: prefix allows to describe the term. Example:

en:Fair trade
description:en:Fair trade is an arrangement designed to help producers in developing countries achieve sustainable and equitable trade relationships. Members of the fair trade movement add the payment of higher prices to exporters, as well as improved social and environmental standards. 

The description is reused on Open Food Facts website. Example: https://world.openfoodfacts.org/label/fair-trade

Image (logo)

The image: prefix allows to add an image/logo related to the term.

<en:Organic
en:Bio Austria
de:Bio Austria
country:en:Austria
image:en:bio-austria.67x90.svg

The description is reused on Open Food Facts website. Example: https://world.openfoodfacts.org/label/bio-austria

Wikidata

The wikidata: prefix allows to link the term with its equivalent on Wikidata database. Example:

fr:Label Rouge
xx:Label Rouge
wikidata:en:Q3214309

The description is reused on Open Food Facts website. Example: https://world.openfoodfacts.org/label/label-rouge

Opposite

"we use "opposite" for imports (e.g. when there is a column "Organic" with values like "No")

en:Non-fair trade, Not fair trade

fr:Non issu du commerce équitable

opposite:en: en:fair-trade"

Wikipedia

The wikipedia: prefix allows to link the term with its equivalent on Wikipedia. Example:

<en:Fair trade
en:Fairtrade USA
xx:Fairtrade USA
wikipedia:en:https://en.wikipedia.org/wiki/Fair_Trade_USA

The wikipedia: property is not reused on Open Food Facts website.

Unused properties

Some people have added more properties in the different taxonomies. They are not used for the moment, but they can lead to interesting usages:

  • give more information for the contributor who is editing the taxonomy
  • prepare new usages or features

Here is a list of properties already used in the taxonomies:

  • wikipedia:
  • country:
  • label_categories:
  • eu_groups:
  • auth_name:
  • auth_address:
  • auth_url:
  • exceptions:

Taxonomy architecture

The taxonomy is not a strict hierarchy: values can have multiple parents. But cycles are not allowed.

Finding new opportunities for taxonomization

  • A good trick to find candidates for taxonomization is
 * https://world.openfoodfacts.org/categories?filter=-
 * https://world.openfoodfacts.org/labels?filter=-
 * https://world.openfoodfacts.org/origins?filter=-
 * https://world.openfoodfacts.org/ingredients?filter=-
  • Everything in italics is up for grabs

Format

# stopwords
stopwords:en: some,stopwords
stopwords:fr: word,that,are,removed,when,matching

# synonyms that are not field values but that are contained in field values
synonyms:en: global,international

en: value, a synonym value, another synonym value
fr: valeur, une valeur synonyme, une autre valeur synonyme

<en: value
en: a child value, a synonym for a child value
fr: une valeur enfant, un synonyme d'une valeur enfant

<en: value
en: another child value

<en: a child value
<en: another child value
en: a grand-child value

# properties
en: value
fr: valeur
description:en: a property of value
description:fr: french version of the property
country_code:en: a property that is the same for all languages -> use English suffix en:
wikidata:en:Q89

Getting taxonomies files

The definitions can be edited on GitHub, they are periodically synchronized on the Open Food Facts database and web site.

Raw taxonomies

(on Github, account and VCS knowledge needed)

Taxonomies in JSON format

Every taxonomy is also available as a json on OpenFoodFacts. Use

  • https://static.openfoodfacts.org/data/taxonomies/<taxonomy name>.json
  • or https://static.openfoodfacts.org/data/taxonomies/<taxonomy name>.full.json to get additional properties

Example for categories:

Taxonomy API

We have a basic taxonomy API to get information about a taxonomy.

Eg. to get information about en:carrots entry in categories taxonomy:

https://world.openfoodfacts.org/api/v2/taxonomy?tagtype=categories&tags=en:carrots

you can add lc (language code) parameter to ask for more than one language. cc is used for country code.

Eg. to add french information on previous request:

https://world.openfoodfacts.org/api/v2/taxonomy?tagtype=categories&tags=en:carrots&lc=en,fr&cc=fr

Draft Taxonomies

Building and deploying taxonomies

Changes to taxonomies on GitHub are not deployed instantly, the need to be built, deployed, and products need to be re-processed with the new taxonomy.

More info

For detailed information specific for the ingredients taxonomy see Ingredients Ontology.