Jump to content

Reusing Open Food Facts Data: Difference between revisions

Add MongoDB dump section
m (Typo)
(Add MongoDB dump section)
Line 13: Line 13:
The whole database can be downloaded at https://world.openfoodfacts.org/data
The whole database can be downloaded at https://world.openfoodfacts.org/data


It's very big. Open Food Facts hosts more than 1,400,000 products (as of July 2020). So you will probably need skills to reuse the data.
It's very big. Open Food Facts hosts more than 2,200,000 products (as of July 2020). So you will probably need skills to reuse the data.


You'll be able to find here different kinds of data.
You'll be able to find here different kinds of data.


==== The MongoDB daily export ====
==== The MongoDB daily export ====
It represents the most complete data; it's very big and you have to know how to deal with MongoDB. It's very big! More than 9GB uncompressed.
It represents the most complete data; it's very big and you have to know how to deal with MongoDB. It's very big! More than 30GB uncompressed.


==== The JSONL daily export ====
==== The JSONL daily export ====
Line 134: Line 134:
  $ zcat openfoodfacts-products.jsonl.gz | jq -r '. | select(.code|test("^[0-9]{1,13}$") | not) | .code' > ean_gt_13.csv
  $ zcat openfoodfacts-products.jsonl.gz | jq -r '. | select(.code|test("^[0-9]{1,13}$") | not) | .code' > ean_gt_13.csv
These operations can be quite long (more than 10 minutes depending on your computer and your selection).
These operations can be quite long (more than 10 minutes depending on your computer and your selection).
=== MongoDB dump ===
The MongoDB dump needs to be reused with MongoDB. It allows building a full replication of the Open Food Facts database and use MongoDB for selecting, filtering and exporting data. Using MongoDB allows faster manipulations compared to the other methods.
First, you '''need a running MongoDB installation'''. Open Food Facts is using MongoDB 4.4. It has been reported that prior version should not work for Open Food Facts dump.
You can see [https://gist.github.com/CharlesNepote/13198c2ed336fc64cb674d63876e8d99 here a quick tutorial on how to install MongoDB on Debian 10 or Debian 11].
==== Import Open Food Facts MongoDB dump into MongoDB ====
<pre>
# Download and decompress the dump
wget https://static.openfoodfacts.org/data/openfoodfacts-mongodbdump.tar.gz
tar -xzf openfoodfacts-mongodbdump.tar.gz
# Restore all the database. mongorestore recreates indexes recorded by mongodump.
mongorestore --drop ./dump
# => 2254885 document(s) restored successfully. 0 document(s) failed to restore.
</pre>
==== Play with the database ====
<pre>
# Display 5 first products in JSON format, using pagination
# https://www.codementor.io/@arpitbhayani/fast-and-efficient-pagination-in-mongodb-9095flbqr
mongo off --eval 'db.products.find().limit(5).pretty().shellPrint()' --quiet
# Combined with JQ (JSON tool) to provide colors
# JQ has to installed separatly. See https://stedolan.github.io/jq/
mongo off --eval 'db.products.find().limit(5).pretty().shellPrint()' --quiet | jq .
# Combined with JQ (JSON tool) to provide colors and compact output (each JSON object on a single line (aka JSONL format))
mongo off --eval 'db.products.find().limit(5).pretty().shellPrint()' --quiet | jq . -c
# Get products from Germany; return fields "code" and "counties_tags"; limit to 2 products
mongo off --eval 'db.products.find({countries_tags: "en:germany"}, {code: 1, countries_tags: 1}).limit(2).pretty().shellPrint()' --quiet
# get the data from one field without _id
mongo off --eval 'db.products.find({countries_tags: "en:germany"}, {_id: 0, countries_tags: 1}).limit(2).pretty().shellPrint()' --quiet
</pre>
==== Export the database ====
<pre>
# Exports
# See: https://www.mongodb.com/docs/database-tools/mongoexport/
# 1. The "aggregate" way
mongo off --eval 'db.products.aggregate([{$match: {product_name: "Coke"}},{$out: "result"}])'
mongoexport --db off --collection result --fields code,product_name --type=csv --out result.csv
# 2. the -q,--query option way
# Export 5 first german products
mongoexport -d off -c products --type=csv --fields code,countries_tags -q '{"countries_tags": "en:germany"}}' --out report.csv --limit 5
# Export to STDIN in CSV format; notice option --quiet
mongoexport -d off -c products --type=csv --fields code,countries_tags -q '{"countries_tags": "en:germany"}' --limit 5 --quiet
# How long to export all German products?
time mongoexport -d off -c products --type=csv --fields code,countries_tags -q '{"countries_tags": "en:germany"}' --out report.csv
# real 0m10.135s
# Specify the fields in a file containing the line-separated list of fields to export (--fieldFile option)
# Official csv export fields are coming from @export_fields variable in /lib/ProductOpener/Config_off.pm
mongoexport -d off -c products --type=csv --fieldFile official_csv_export_fields.txt -q '{"countries_tags": "en:germany"}' --limit 5 --quiet
</pre>