Product Opener - How to update a taxonomy
Project:ProductOpener > Product Opener documentation
Introduction
Global taxonomies are used in Product Opener to "tag" products. For instance Open Food Facts uses taxonomies for categories, countries, labels, additives etc.
The taxonomies are multilingual (each tag value can have translations in different languages) and hierarchical (tags have parents, and applying a tag to a product also applies the parent tags).
Taxonomy files
Taxonomies are kept in the taxonomies directory:
stephane@ks3095298:~/product-opener/taxonomies$ ls additives.result.sto categories.result.txt countries.txt additives.result.txt categories.txt labels.result.sto additives.txt countries.result.sto labels.result.txt categories.result.sto countries.result.txt labels.txt
- categories.txt is the source definition of the categories taxonomy
- categories.result.txt is the resulting normalized definition after compilation in the same format as the source categories.txt
- entries are normalized (e.g. the en:value is preferred to specify parents)
- entries can be sorted in a different order
- categories.result.sto in the resulting binary representation of the taxonomy, in Perl's native format
Taxonomy source format
Taxonomy source files
Open Food Facts taxonomy source files are on the wiki so that contributors can improve them (adding tags, adding translations, reorganizing tags): http://en.wiki.openfoodfacts.org/Global_taxonomies
Compilation
Before taxonomies can be used in Product Opener, they need to be compiled in a binary format. The compilation has two goals:
- Generate as many synonyms as possible for each tag
- This is so that end users can enter any form of a tag and still have it matched to the canonical tag
- e.g. "Yaourts fraise" and "Yaourt aux fraises"
- Compute the hierarchy
- for each tag, knowing which tags are parents and childrens
The compilation code is in the build_tags_taxonomy() function of Tags.pm. This function is called by script/build_tags_taxonomy.pl
Compilation steps
The compilation steps are:
- read translations and synonyms
- compute synonyms for each tag (using synonyms for some of the words)
- add more synonyms (remove stopwords, generate simple singular/plural forms)
- compute the hierarchy
Note: if the taxonomy is big and/or contain many synoym
Compilation script
Testing
stephane@ks3095298:~/product-opener/scripts$ ./build_tags_taxonomy.pl categories
This will compile the categories but generate only categories.results.txt from categories.txt
You should review the resulting txt to make sure unwanted synonyms or relations were not created.
empty names :
An empty name " : " indicates there is an error in the taxonomy source file, usually a non-existing entry was entered as the parent.
< : < en:Wheat flours en:Wheat malt flours es:Harinas de trigo malteadas, Harinas de malta de trigo fr:Farines de malt de blĂŠ
Check the source file to see which entry is wrong:
<en:Wheat flours <fr:Malt flours en:Wheat malt flours es:Harinas de trigo malteadas, Harinas de malta de trigo fr:Farines de malt de blĂŠ
Here "fr:Malt flours" is non existing (and should be "en:Malt flours").
Note: the build_tags_taxonomy.pl script has been updated to report those errors.
Publishing
stephane@ks3095298:~/product-opener/scripts$ ./build_tags_taxonomy.pl categories publish
This will also store categories.result.sto
Test
Taxonomies are very powerful tools, but with great powers comes great risks. The synonyms, stopwords etc. can create unwanted synonyms.
Testing tools
We need testing tools. Please create some. :-)
Restart Apache
After restarting Apache, editing a product will recompute all tags using the newest published taxonomies.
Updating products
The scripts directory contains scripts that will recompute tags for all products in the database.
stephane@ks3095298:~/product-opener/scripts$ ./update_all_products_categories.pl