How to build and deploy taxonomies: Difference between revisions

From Open Food Facts wiki
(Instructions to build and deploy taxonomies)
 
No edit summary
Line 76: Line 76:


Note: in production, with 1.5 million products, it can take multiple days to re-process all products.
Note: in production, with 1.5 million products, it can take multiple days to re-process all products.
= Incorporating translations made from the web =
'''Translations are stored in files on the server'''
<pre>
off1:/srv/off/translate# ls -lrt
total 164
-rw-r--r-- 1 off  off  4100 Mar 26 19:42 ingredients.nl.txt
drwxr-xr-x 2 root root  4096 Mar 29 10:20 applied.20190329
-rw-r--r-- 1 off  off    619 Mar 29 16:17 ingredients.de.txt
-rw-r--r-- 1 off  off  9472 Apr  1 18:21 ingredients.fr.txt
drwxr-xr-x 2 root root  4096 Apr  9 19:16 applied.20190409
-rw-r--r-- 1 off  off    659 Apr  9 22:12 labels.hu.txt
-rw-r--r-- 1 off  off    924 Apr 11 08:54 nova_groups.ca.txt
-rw-r--r-- 1 off  off  6699 Apr 19 00:27 categories.hu.txt
-rw-r--r-- 1 off  off    176 Apr 19 17:10 categories.zh.txt
-rw-r--r-- 1 off  off    710 Apr 19 17:25 labels.zh.txt
-rw-r--r-- 1 off  off  2392 Apr 21 23:22 labels.pl.txt
-rw-r--r-- 1 off  off  9864 Apr 24 11:52 categories.ca.txt
-rw-r--r-- 1 off  off  13479 Apr 24 13:40 labels.ca.txt
-rw-r--r-- 1 off  off  2141 Apr 25 10:22 labels.he.txt
-rw-r--r-- 1 off  off  1008 May  3 13:06 categories.pl.txt
-rw-r--r-- 1 off  off    616 May  5 22:35 categories.it.txt
-rw-r--r-- 1 off  off  8493 May  8 21:34 categories.de.txt
-rw-r--r-- 1 off  off  2488 May  9 16:09 categories.fr.txt
-rw-r--r-- 1 off  off  2213 May  9 16:13 labels.de.txt
-rw-r--r-- 1 off  off    955 May  9 16:15 labels.fr.txt
-rw-r--r-- 1 off  off  34333 May  9 18:52 categories.nl.txt
</pre>
== Steps ==
'''Try this on the test server first.'''
=== Add the translations ===
* ''/srv/off/scripts# ./add_users_translations_to_taxonomy.pl categories > /home/off/openfoodfacts-server/taxonomies/categories.txt''
* Review the diffs: ''git diff'' (there should be mostly additions)
* Commit and push
* Move the applied translations to a new folder
* ''/srv/off/translate# mkdir applied.20190513''
* ''/srv/off/translate# mv categories.* applied.20190513/''
=== Build the taxonomy ===
* as root (sudo su)
* ''cp -a /home/off/openfoodfacts-server/taxonomies/categories.txt /srv/off/taxonomies/''
* ''export PERL5LIB=.''
* ''/srv/off/scripts# ./build_tags_taxonomy.pl categories publish''
* Wait. Some taxonomies like categories can take 30 minutes to build.
=== Stop and start Apache (not a restart) ===
* ''systemctl stop apache2@off''
* ''systemctl start apache2@off''
=== Apply the new taxonomy to existing products ===
* load https://world.openfoodfacts.org/categories in a browser tab, so that you can check big differences at the top of the list
* as user off (VERY IMPORTANT)
* ''export PERL5LIB=.''
* ''/srv/off/scripts$ ./update_all_products.pl --key update_categories_taxonomy --fields categories''
* load https://world.openfoodfacts.org/categories in a new tab, look to see if there are big differences

Revision as of 12:06, 12 June 2020

Introduction

Open Food Facts uses Global taxonomies (multilingual hierarchies) for categories, labels, ingredients, additives and many other product facets.

Taxonomy files

Source

The taxonomies are defined in text files (e.g. labels.txt) which are kept /taxonomy directory on GitHub

Build

When the source text file of a category is updated, it needs to be built in a structured representation, stored in Perl binary .sto files (e.g. labels.result.sto).

Built taxonomies are also stored on GitHub, but they may not be up-to-date if the source file has changed and the taxonomy has not been rebuilt.

JSON export

The build taxonomy is also exported to a JSON structure (e.g. labels.json).

Building taxonomies

The build_tags_taxonomy.pl is used to compile the taxonomy source file in a built taxonomy.

cd scripts
export PERL5LIB=.
./build_tags_taxonomy.pl labels publish

The built taxonomy files will be stored in the taxonomy directory, along with the source files.

Deploying taxonomies

Stop and start the web site backend to reload taxonomies

The Open Food Facts web site backend (Apache + mod_perl) needs to be stopped and started for the new taxonomies to be loaded.

Update products with the new taxonomies

Recompute facets

Taxonomies correspond to product facets that are stored in MongoDB.

e.g. for labels, the labels tag is parsed with the labels taxonomy and it populates the labels_tags field, which is an array of canonical entries like "en:organic".

To recompute the facets corresponding to the taxonomy, we need to update all products.

This script updates all products in MongoDB and on the file server, it must be run as the off user, or products won't be editable.

sudo su off
cd scripts
export PERL5LIB=.
nice ./update_all_products.pl --fields labels --key labels-20200611

The key field is used to tag updated products, so that we don't have to go through every product if the script is killed.

Ingredients analysis reprocessing

Some taxonomies are used for ingredients processing: ingredients.txt, ingredients_processing.txt, additives.txt, labels.txt, vitamins.txt, minerals.txt.

To re-process the ingredients analysis:

This script updates all products in MongoDB and on the file server, it must be run as the off user, or products won't be editable.

sudo su off
cd scripts
export PERL5LIB=.
nice ./update_all_products.pl --process-ingredients --key labels-20200611

The key field is used to tag updated products, so that we don't have to go through every product if the script is killed.

Note: in production, with 1.5 million products, it can take multiple days to re-process all products.

Incorporating translations made from the web

Translations are stored in files on the server

off1:/srv/off/translate# ls -lrt

total 164
-rw-r--r-- 1 off  off   4100 Mar 26 19:42 ingredients.nl.txt
drwxr-xr-x 2 root root  4096 Mar 29 10:20 applied.20190329
-rw-r--r-- 1 off  off    619 Mar 29 16:17 ingredients.de.txt
-rw-r--r-- 1 off  off   9472 Apr  1 18:21 ingredients.fr.txt
drwxr-xr-x 2 root root  4096 Apr  9 19:16 applied.20190409
-rw-r--r-- 1 off  off    659 Apr  9 22:12 labels.hu.txt
-rw-r--r-- 1 off  off    924 Apr 11 08:54 nova_groups.ca.txt
-rw-r--r-- 1 off  off   6699 Apr 19 00:27 categories.hu.txt
-rw-r--r-- 1 off  off    176 Apr 19 17:10 categories.zh.txt
-rw-r--r-- 1 off  off    710 Apr 19 17:25 labels.zh.txt
-rw-r--r-- 1 off  off   2392 Apr 21 23:22 labels.pl.txt
-rw-r--r-- 1 off  off   9864 Apr 24 11:52 categories.ca.txt
-rw-r--r-- 1 off  off  13479 Apr 24 13:40 labels.ca.txt
-rw-r--r-- 1 off  off   2141 Apr 25 10:22 labels.he.txt
-rw-r--r-- 1 off  off   1008 May  3 13:06 categories.pl.txt
-rw-r--r-- 1 off  off    616 May  5 22:35 categories.it.txt
-rw-r--r-- 1 off  off   8493 May  8 21:34 categories.de.txt
-rw-r--r-- 1 off  off   2488 May  9 16:09 categories.fr.txt
-rw-r--r-- 1 off  off   2213 May  9 16:13 labels.de.txt
-rw-r--r-- 1 off  off    955 May  9 16:15 labels.fr.txt
-rw-r--r-- 1 off  off  34333 May  9 18:52 categories.nl.txt

Steps

Try this on the test server first.

Add the translations

  • /srv/off/scripts# ./add_users_translations_to_taxonomy.pl categories > /home/off/openfoodfacts-server/taxonomies/categories.txt
  • Review the diffs: git diff (there should be mostly additions)
  • Commit and push
  • Move the applied translations to a new folder
  • /srv/off/translate# mkdir applied.20190513
  • /srv/off/translate# mv categories.* applied.20190513/

Build the taxonomy

  • as root (sudo su)
  • cp -a /home/off/openfoodfacts-server/taxonomies/categories.txt /srv/off/taxonomies/
  • export PERL5LIB=.
  • /srv/off/scripts# ./build_tags_taxonomy.pl categories publish
  • Wait. Some taxonomies like categories can take 30 minutes to build.

Stop and start Apache (not a restart)

  • systemctl stop apache2@off
  • systemctl start apache2@off

Apply the new taxonomy to existing products