Product Opener - How to update a taxonomy

From Open Food Facts wiki

Project:ProductOpener > Product Opener documentation

Introduction

Global taxonomies are used in Product Opener to "tag" products. For instance Open Food Facts uses taxonomies for categories, countries, labels, additives etc.

The taxonomies are multilingual (each tag value can have translations in different languages) and hierarchical (tags have parents, and applying a tag to a product also applies the parent tags).

Taxonomy files

Taxonomies are kept in the taxonomies directory:

stephane@ks3095298:~/product-opener/taxonomies$ ls
additives.result.sto   categories.result.txt  countries.txt
additives.result.txt   categories.txt         labels.result.sto
additives.txt          countries.result.sto   labels.result.txt
categories.result.sto  countries.result.txt   labels.txt
  • categories.txt is the source definition of the categories taxonomy
  • categories.result.txt is the resulting normalized definition after compilation in the same format as the source categories.txt
    • entries are normalized (e.g. the en:value is preferred to specify parents)
    • entries can be sorted in a different order
  • categories.result.sto in the resulting binary representation of the taxonomy, in Perl's native format

Taxonomy source format

See Global taxonomies

Taxonomy source files

Open Food Facts taxonomy source files are on the wiki so that contributors can improve them (adding tags, adding translations, reorganizing tags): http://en.wiki.openfoodfacts.org/Global_taxonomies

Compilation

Before taxonomies can be used in Product Opener, they need to be compiled in a binary format. The compilation has two goals:

  1. Generate as many synonyms as possible for each tag
    • This is so that end users can enter any form of a tag and still have it matched to the canonical tag
    • e.g. "Yaourts fraise" and "Yaourt aux fraises"
  2. Compute the hierarchy
    • for each tag, knowing which tags are parents and childrens

The compilation code is in the build_tags_taxonomy() function of Tags.pm. This function is called by script/build_tags_taxonomy.pl

Compilation steps

The compilation steps are:

  • read translations and synonyms
  • compute synonyms for each tag (using synonyms for some of the words)
  • add more synonyms (remove stopwords, generate simple singular/plural forms)
  • compute the hierarchy

Note: if the taxonomy is big and/or contain many synoym

Compilation script

Testing

stephane@ks3095298:~/product-opener/scripts$ ./build_tags_taxonomy.pl categories

This will compile the categories but generate only categories.results.txt from categories.txt

You should review the resulting txt to make sure unwanted synonyms or relations were not created.

empty names :

An empty name " : " indicates there is an error in the taxonomy source file, usually a non-existing entry was entered as the parent.

< :
< en:Wheat flours
en:Wheat malt flours
es:Harinas de trigo malteadas, Harinas de malta de trigo
fr:Farines de malt de blĂŠ

Check the source file to see which entry is wrong:

<en:Wheat flours
<fr:Malt flours
en:Wheat malt flours
es:Harinas de trigo malteadas, Harinas de malta de trigo
fr:Farines de malt de blĂŠ

Here "fr:Malt flours" is non existing (and should be "en:Malt flours").

Note: the build_tags_taxonomy.pl script has been updated to report those errors.

Publishing

stephane@ks3095298:~/product-opener/scripts$ ./build_tags_taxonomy.pl categories publish

This will also store categories.result.sto

Test

Taxonomies are very powerful tools, but with great powers comes great risks. The synonyms, stopwords etc. can create unwanted synonyms.

Testing tools

We need testing tools. Please create some. :-)

Restart Apache


After restarting Apache, editing a product will recompute all tags using the newest published taxonomies.


Updating products

The scripts directory contains scripts that will recompute tags for all products in the database.

stephane@ks3095298:~/product-opener/scripts$ ./update_all_products_categories.pl