951
edits
m (Typo) |
(Add MongoDB dump section) |
||
Line 13: | Line 13: | ||
The whole database can be downloaded at https://world.openfoodfacts.org/data | The whole database can be downloaded at https://world.openfoodfacts.org/data | ||
It's very big. Open Food Facts hosts more than | It's very big. Open Food Facts hosts more than 2,200,000 products (as of July 2020). So you will probably need skills to reuse the data. | ||
You'll be able to find here different kinds of data. | You'll be able to find here different kinds of data. | ||
==== The MongoDB daily export ==== | ==== The MongoDB daily export ==== | ||
It represents the most complete data; it's very big and you have to know how to deal with MongoDB. It's very big! More than | It represents the most complete data; it's very big and you have to know how to deal with MongoDB. It's very big! More than 30GB uncompressed. | ||
==== The JSONL daily export ==== | ==== The JSONL daily export ==== | ||
Line 134: | Line 134: | ||
$ zcat openfoodfacts-products.jsonl.gz | jq -r '. | select(.code|test("^[0-9]{1,13}$") | not) | .code' > ean_gt_13.csv | $ zcat openfoodfacts-products.jsonl.gz | jq -r '. | select(.code|test("^[0-9]{1,13}$") | not) | .code' > ean_gt_13.csv | ||
These operations can be quite long (more than 10 minutes depending on your computer and your selection). | These operations can be quite long (more than 10 minutes depending on your computer and your selection). | ||
=== MongoDB dump === | |||
The MongoDB dump needs to be reused with MongoDB. It allows building a full replication of the Open Food Facts database and use MongoDB for selecting, filtering and exporting data. Using MongoDB allows faster manipulations compared to the other methods. | |||
First, you '''need a running MongoDB installation'''. Open Food Facts is using MongoDB 4.4. It has been reported that prior version should not work for Open Food Facts dump. | |||
You can see [https://gist.github.com/CharlesNepote/13198c2ed336fc64cb674d63876e8d99 here a quick tutorial on how to install MongoDB on Debian 10 or Debian 11]. | |||
==== Import Open Food Facts MongoDB dump into MongoDB ==== | |||
<pre> | |||
# Download and decompress the dump | |||
wget https://static.openfoodfacts.org/data/openfoodfacts-mongodbdump.tar.gz | |||
tar -xzf openfoodfacts-mongodbdump.tar.gz | |||
# Restore all the database. mongorestore recreates indexes recorded by mongodump. | |||
mongorestore --drop ./dump | |||
# => 2254885 document(s) restored successfully. 0 document(s) failed to restore. | |||
</pre> | |||
==== Play with the database ==== | |||
<pre> | |||
# Display 5 first products in JSON format, using pagination | |||
# https://www.codementor.io/@arpitbhayani/fast-and-efficient-pagination-in-mongodb-9095flbqr | |||
mongo off --eval 'db.products.find().limit(5).pretty().shellPrint()' --quiet | |||
# Combined with JQ (JSON tool) to provide colors | |||
# JQ has to installed separatly. See https://stedolan.github.io/jq/ | |||
mongo off --eval 'db.products.find().limit(5).pretty().shellPrint()' --quiet | jq . | |||
# Combined with JQ (JSON tool) to provide colors and compact output (each JSON object on a single line (aka JSONL format)) | |||
mongo off --eval 'db.products.find().limit(5).pretty().shellPrint()' --quiet | jq . -c | |||
# Get products from Germany; return fields "code" and "counties_tags"; limit to 2 products | |||
mongo off --eval 'db.products.find({countries_tags: "en:germany"}, {code: 1, countries_tags: 1}).limit(2).pretty().shellPrint()' --quiet | |||
# get the data from one field without _id | |||
mongo off --eval 'db.products.find({countries_tags: "en:germany"}, {_id: 0, countries_tags: 1}).limit(2).pretty().shellPrint()' --quiet | |||
</pre> | |||
==== Export the database ==== | |||
<pre> | |||
# Exports | |||
# See: https://www.mongodb.com/docs/database-tools/mongoexport/ | |||
# 1. The "aggregate" way | |||
mongo off --eval 'db.products.aggregate([{$match: {product_name: "Coke"}},{$out: "result"}])' | |||
mongoexport --db off --collection result --fields code,product_name --type=csv --out result.csv | |||
# 2. the -q,--query option way | |||
# Export 5 first german products | |||
mongoexport -d off -c products --type=csv --fields code,countries_tags -q '{"countries_tags": "en:germany"}}' --out report.csv --limit 5 | |||
# Export to STDIN in CSV format; notice option --quiet | |||
mongoexport -d off -c products --type=csv --fields code,countries_tags -q '{"countries_tags": "en:germany"}' --limit 5 --quiet | |||
# How long to export all German products? | |||
time mongoexport -d off -c products --type=csv --fields code,countries_tags -q '{"countries_tags": "en:germany"}' --out report.csv | |||
# real 0m10.135s | |||
# Specify the fields in a file containing the line-separated list of fields to export (--fieldFile option) | |||
# Official csv export fields are coming from @export_fields variable in /lib/ProductOpener/Config_off.pm | |||
mongoexport -d off -c products --type=csv --fieldFile official_csv_export_fields.txt -q '{"countries_tags": "en:germany"}' --limit 5 --quiet | |||
</pre> |