Reusing Open Food Facts Data: Difference between revisions

From Open Food Facts wiki
(New doc to import MongoDB dump)
 
(33 intermediate revisions by 7 users not shown)
Line 1: Line 1:
Open Food Facts data is released as Open Data: it can be reused freely by anyone, under the Open Database License (ODBL).
[[Category:Reuse]]
Open Food Facts data is released as Open Data: it can be reused freely by anyone, under the Open Database License (ODBL). While this page is related to practical reuse, you must really be aware of [[ODBL License|rights and duties provided by the Open Database]] License (ODBL).


== Where is the data? ==
== Where is the data? ==
Line 5: Line 6:


=== Looking for a selection of products? ===
=== Looking for a selection of products? ===
Then use the advanced search. The Open Food Facts advanced search feature allows to download selections of the data. See: https://world.openfoodfacts.org/cgi/search.pl
Then use '''the advanced search'''. The Open Food Facts advanced search feature allows to download selections of the data. See: https://world.openfoodfacts.org/cgi/search.pl


When you search is done, you will be able to download the selection in '''CSV or Excel format''', just give a try!
When your search is done, you will be able to download the selection in '''CSV or Excel format''', just give a try!
 
Do notice you can download up to 10,000 results only. If you need to download more results you have to use some other methods described below.


=== Looking for the whole database? ===
=== Looking for the whole database? ===
The whole database can be downloaded at https://world.openfoodfacts.org/data
The whole database can be downloaded at https://world.openfoodfacts.org/data


It's very big. Open Food Facts hosts more than 1,200,000 products (as of April 2020). So you will probably need skills to reuse the data.
It's very big. Open Food Facts hosts more than 2,800,000 products (as of April 2023). So you will probably need skills to reuse the data.


You'll be able to find there different kinds of data.
You'll be able to find here different kinds of data.


==== The MongoDB daily export ====
==== The MongoDB daily export ====
It represents the most complete data; it's very big and you have to know how to deal with MongoDB.
It represents the most complete data; it's very big and you have to know how to deal with MongoDB. It's very big! More than 30GB uncompressed.


==== The jsonl daily export ====
==== The JSONL daily export ====
While still undocumented, there is a daily export of the whole database in jsonl format. It represents the same data as the MongoDB export. It's very big! More than 14GB uncompressed.
There is a daily export of the whole database in [https://jsonlines.org/ JSONL format] (sometimes called LDJSON or NDJSON) where each line is a JSON object. It represents the same data as the MongoDB export. The file is 4,8GB (2022-10), compressed with gzip. It takes more than 14GB uncompressed.


You can find it at https://static.openfoodfacts.org/data/openfoodfacts-products.jsonl.gz
You can find it at https://static.openfoodfacts.org/data/openfoodfacts-products.jsonl.gz


==== The CSV daily export ====
==== The CSV daily export ====
It represents a subset of the database but it is generally fitted to the majority of usages. It's a 2.3GB file (as of April 2020), so it can't be opened by Libre Office or Excel with an 8GB machine.
It contains all the products, but with a subset of the database fields. [https://world.openfoodfacts.org/data/data-fields.txt This subset is very large] and include main characteristics (EAN, name, brand...), many tags (such as categories, origins, labels, packaging...), ingredients and nutrition facts. Thus, it is generally fitted to the majority of usages. It's a 2.3GB file (as of April 2020), so it can't be opened by Libre Office or Excel with an 8GB machine.


== How to reuse? ==
== How to reuse? ==
Line 33: Line 36:
==== csvkit tips ====
==== csvkit tips ====
[https://csvkit.readthedocs.io/en/latest/ csvkit] is a very efficient tool to manipulate huge amounts of CSV data. Here are some useful tips to manipulate Open Food Facts CSV export.
[https://csvkit.readthedocs.io/en/latest/ csvkit] is a very efficient tool to manipulate huge amounts of CSV data. Here are some useful tips to manipulate Open Food Facts CSV export.
''Converting whole Open Food Facts "CSV" export to regular CSV''. Open Food Facts export use tabs as separator: it should be called TSV (tab separated values) instead of CSV (comma separated values). <code>csvkit</code> can convert TSV file into CSV very easily:
<code>$ csvclean -t en.openfoodfacts.org.products.csv > myCSV.csv</code>


''Selecting 2 column''s. Selecting two or three columns can be useful for some usages. Extracting two columns produce a smaller CSV file which can be opened by common softwares such as Libre Office or Excel. The following command creates a CSV file (brands.csv) containing two columns from Open Food Facts (code and brands). (It generally takes more than 2 minutes, depending on your computer.)
''Selecting 2 column''s. Selecting two or three columns can be useful for some usages. Extracting two columns produce a smaller CSV file which can be opened by common softwares such as Libre Office or Excel. The following command creates a CSV file (brands.csv) containing two columns from Open Food Facts (code and brands). (It generally takes more than 2 minutes, depending on your computer.)
Line 54: Line 61:
==== Import CSV in PostgreSQL ====
==== Import CSV in PostgreSQL ====
See this article: https://blog-postgresql.verite.pro/2018/12/21/import-openfoodfacts.html (in french, but should be understandable with Google Translator).
See this article: https://blog-postgresql.verite.pro/2018/12/21/import-openfoodfacts.html (in french, but should be understandable with Google Translator).
Alternative way - feel free to use a project from github: https://github.com/ArchiMageAlex/off_converter


==== Import CSV to SQLite ====
==== Import CSV to SQLite ====


The repository [https://github.com/fairdirect/foodrescue-content foodrescue-content] contains Ruby scripts that import Open Food Facts CSV data into a [https://www.sqlite.org/index.html SQLite] database with full table normalization. Only a few fields are imported so far, but this an be extended easily. Data imported so far includes:  
The repository [https://github.com/fairdirect/foodrescue-content foodrescue-content] contains Ruby scripts that import Open Food Facts CSV data into a [https://www.sqlite.org/index.html SQLite] database with full table normalization. Only a few fields are imported so far, but this can be extended easily. Data imported so far includes:  


* barcode number
* barcode number
Line 64: Line 73:
* product countries
* product countries
* full categories hierarchy imported from the <code>categories.txt</code> taxonomy ([https://github.com/openfoodfacts/openfoodfacts-server/tree/master/taxonomies see])
* full categories hierarchy imported from the <code>categories.txt</code> taxonomy ([https://github.com/openfoodfacts/openfoodfacts-server/tree/master/taxonomies see])
==== Import CSV to DuckDB ====
[https://duckdb.org/ DuckDB] is very close to SQLite, except it has higher performances: database size is 3 times lighter, and requests performs 5-10 times better.
# Discard invalid characters
<nowiki>#</nowiki> duckdb doesn't like invalid UTF8. It did not want to read some parquet file as such, with the following error:
<nowiki>#</nowiki> Error: near line 1: Invalid Input Error: Invalid string encoding found in Parquet file: value "........."
<nowiki>#</nowiki> (occuring namely on this product: <nowiki>https://world.openfoodfacts.org/product/9900109008673?rev=4</nowiki> )
<nowiki>#</nowiki> The issue, and its solution below, seems to be well-known: <nowiki>https://til.simonwillison.net/linux/iconv</nowiki>
iconv -f utf-8 -t utf-8 -c en.openfoodfacts.org.products.csv -o en.openfoodfacts.org.products.converted.csv
<nowiki>#</nowiki> Create duckdb database and import data
duckdb products.db <<EOF
CREATE TABLE products AS
<nowiki> </nowiki>   SELECT * FROM read_csv_auto('en.openfoodfacts.org.products.converted.csv', quote=<nowiki>''</nowiki>, sample_size=3000000, delim='\t');
EOF
<nowiki>#</nowiki> Then you can try a SQL request
duckdb products.db -csv <<EOF
SELECT * FROM products
<nowiki> </nowiki> WHERE completeness > 0.99 -- products with a good level of completeness
<nowiki> </nowiki> ORDER BY last_modified_datetime LIMIT 10;
EOF
We wrote a detailled article about using DuckDB with the Open Food Facts database. [https://blog.openfoodfacts.org/en/news/food-transparency-in-the-palm-of-your-hand-explore-the-largest-open-food-database-using-duckdb-%f0%9f%a6%86x%f0%9f%8d%8a Check it out]!


==== Python ====
==== Python ====
Line 76: Line 108:
==== R stat ====
==== R stat ====
For people who have R stat skills, there are [https://www.kaggle.com/openfoodfacts/world-food-facts/kernels?sortBy=hotness&group=everyone&pageSize=20&datasetId=20&language=R more than 50 notebooks from Kaggle community].
For people who have R stat skills, there are [https://www.kaggle.com/openfoodfacts/world-food-facts/kernels?sortBy=hotness&group=everyone&pageSize=20&datasetId=20&language=R more than 50 notebooks from Kaggle community].
Moreover, here a link to transform .bson file to a dataframe: https://github.com/gnaweric/openfoodfact_database_queries
With the use of {mongolite}, first connect to the base, then import the .bson file, then get a sample of it to make sure it is ready. Finally save it to a .rdata file for example.
Beware, each line is a product and some variable need to be unnest: tidyverser::unnest_wider()
=== JSONL delta exports ===
Every day, Open Food Facts exports all the products created during the last 24 hours. The documentation of this export can be found in the /data page.
If you don't have MongoDB and just want to use these delta exports to build an up-to-date database, you can merge each export with the help of <code>[https://stedolan.github.io/jq/manual/v1.6/ jq]</code> tool.
$ gunzip products_1638076899_1638162314.json.gz # will decompress the file
$ wc -l products_1638076899_1638162314.json # will count the number of products in this export (in JSONL each line is a JSON object)
$ jq -c '. + .' 2021-11-30.json products_1638162314_1638248379.json > 2021-12-01.json # merge the delta with previous complete data


=== JSONL export ===
=== JSONL export ===
JSONL is a huge file! It's not possible to play with it with common editors or common tools. But there is some command line tools that allows interesting things, like [https://stedolan.github.io/jq/manual/v1.6/ jq].
JSONL is a huge file! It's not possible to play with it with common editors or common tools. There is some command line tools that allows interesting things, like [https://duckdb.org/ DuckDB] or [https://stedolan.github.io/jq/manual/v1.6/ jq].
 
==== DuckDB ====
DuckDB recognizes the compressed JSONL file, there is no need to uncompress it. You can start playing with JSONL export with very few commands.
 
At first create a DuckDB file:
$ duckdb off.db
Then you can start playing with SQL commands.
SELECT code, product_name, countries_en FROM read_ndjson('openfoodfacts-products.jsonl.gz', ignore_errors=True)
WHERE countries_en LIKE "%Germany%" LIMIT 10;
The advantages of DuckDB over the other solutions are that it is straightfoward and very fast.
 
If you want to go further, you can check the [https://blog.openfoodfacts.org/en/news/food-transparency-in-the-palm-of-your-hand-explore-the-largest-open-food-database-using-duckdb-%f0%9f%a6%86x%f0%9f%8d%8a blog article from Jeremy Arancio].


==== jq ====
==== jq ====
* start decompress the file (be carreful => 14GB after decompression):
* start decompress the file (be careful => 14GB after decompression):
  $ gunzip openfoodfacts-products.jsonl.gz
  $ gunzip openfoodfacts-products.jsonl.gz
* work on a small subset to test. E.g. for 100 products:
* work on a small subset to test. E.g. for 100 products:
Line 96: Line 155:
  $ cat openfoodfacts-products.jsonl | jq -r '[.code,.product_name] | @csv' > names.csv # output CSV file (name.csv) containing all products with code,product_name
  $ cat openfoodfacts-products.jsonl | jq -r '[.code,.product_name] | @csv' > names.csv # output CSV file (name.csv) containing all products with code,product_name


If you don't have enough disk place to uncompress the .gz file, you can use zcat directly on the compressed file. Example:
If you don't have enough disk space to uncompress the .gz file, you can use zcat directly on the compressed file. Example:
  $ zcat openfoodfacts-products.jsonl.gz | jq -r '[.code,.product_name] | @csv' # output CSV data containing code,product_name
  $ zcat openfoodfacts-products.jsonl.gz | jq -r '[.code,.product_name] | @csv' # output CSV data containing code,product_name
==== Filtering JSONL export with jq ====
Filtering a specific country:
$ zcat openfoodfacts-products.jsonl.gz | jq '. | select(.countries_tags[]? == "en:germany")'
The previous command produces a json output containing all the products sold in Germany. If you want a JSONL output, add -c parameter.
$ zcat openfoodfacts-products.jsonl.gz | jq -c '. | select(.countries_tags[]? == "en:germany")'
You can add multiple filters and export the result to a CSV file. For example, here is a command that 1. selects products having the Nutri-Score computed and belonging to the TOP 90% most scanned products in 2020, and 2. exports barcode (<code>code</code>) and number of scans (<code>scans_n</code>) as a CSV file.
$ zcat openfoodfacts-products.jsonl.gz | jq -r '. | select(.misc_tags[]? == "en:nutriscore-computed" and .popularity_tags[]? == "top-90-percent-scans-2020") | [.code,.scans_n] | @csv' > displayed.ns.in.top90.2020.world.csv
Filtering barcodes which are different from a code containing 1 to 13 digits:
$ zcat openfoodfacts-products.jsonl.gz | jq -r '. | select(.code|test("^[0-9]{1,13}$") | not) | .code' > ean_gt_13.csv
Some part of the data are arrays, you must aggregate them using <code>join</code> for CSV export. For example, to export each product and its states in CSV:
$ zcat openfoodfacts-products.jsonl.gz | jq -r '[.code,(.states_tags|join(","))] | @csv'
Selecting products with quality issues and exporting the barcode and the issues in CSV:
$ zcat openfoodfacts-products.jsonl.gz | jq -r '. | select(.data_quality_errors_tags[]? != "")' | jq -r '[.code,(.data_quality_errors_tags|join(","))] | @csv'
These operations can be quite long (more than 10 minutes depending on your computer and your selection).
=== MongoDB dump ===
The [https://world.openfoodfacts.org/data MongoDB dump] needs to be reused with MongoDB. It allows building a full replication of the Open Food Facts database and use MongoDB for selecting, filtering and exporting data. Using MongoDB allows faster manipulations compared to the other methods.
First, you '''need a running MongoDB installation'''. Open Food Facts is using MongoDB 4.4. It has been reported that prior version should not work for Open Food Facts dump.
You can see [https://gist.github.com/CharlesNepote/13198c2ed336fc64cb674d63876e8d99 here a quick tutorial on how to install MongoDB on Debian 10 or Debian 11].
==== Import Open Food Facts MongoDB dump into MongoDB ====
<pre>
# Download
wget https://static.openfoodfacts.org/data/openfoodfacts-mongodbdump.gz
# Import (no need to uncompress)
# --nsFrom=off.products          # source of the database / collection
# --nsTo=mydatabase.mycollection  # local database / collection
# --drop                          # Before restoring the collections from the dumped backup, drops the collections from the target database.
#                                # --drop does not drop collections that are not in the backup.
mongorestore -vvvvv  --gzip --archive="./openfoodfacts-mongodbdump.gz" --nsFrom=off.products --nsTo=mydatabase.mycollection --drop
</pre>
==== (Deprecated) Import Open Food Facts MongoDB dump into MongoDB ====
<pre>
# Download and decompress the dump
wget https://static.openfoodfacts.org/data/openfoodfacts-mongodbdump.tar.gz
tar -xzf openfoodfacts-mongodbdump.tar.gz
# Restore all the database. mongorestore recreates indexes recorded by mongodump.
mongorestore --drop ./dump
# => 2254885 document(s) restored successfully. 0 document(s) failed to restore.
#On newer versions of mongorestore, you can use `--gzip --archive=openfoodfacts-mongodbdump.tar.gz`
</pre>
==== Play with the database ====
<pre>
# Display 5 first products in JSON format, using pagination
# https://www.codementor.io/@arpitbhayani/fast-and-efficient-pagination-in-mongodb-9095flbqr
mongo off --eval 'db.products.find().limit(5).pretty().shellPrint()' --quiet
# Combined with JQ (JSON tool) to provide colors
# JQ has to installed separatly. See https://stedolan.github.io/jq/
mongo off --eval 'db.products.find().limit(5).pretty().shellPrint()' --quiet | jq .
# Combined with JQ (JSON tool) to provide colors and compact output (each JSON object on a single line (aka JSONL format))
mongo off --eval 'db.products.find().limit(5).pretty().shellPrint()' --quiet | jq . -c
# Get products from Germany; return fields "code" and "counties_tags"; limit to 2 products
mongo off --eval 'db.products.find({countries_tags: "en:germany"}, {code: 1, countries_tags: 1}).limit(2).pretty().shellPrint()' --quiet
# get the data from one field without _id
mongo off --eval 'db.products.find({countries_tags: "en:germany"}, {_id: 0, countries_tags: 1}).limit(2).pretty().shellPrint()' --quiet
</pre>
==== Export the database ====
<pre>
# Exports
# See: https://www.mongodb.com/docs/database-tools/mongoexport/
# 1. The "aggregate" way
mongo off --eval 'db.products.aggregate([{$match: {product_name: "Coke"}},{$out: "result"}])'
mongoexport --db off --collection result --fields code,product_name --type=csv --out result.csv
# 2. the -q,--query option way
# Export 5 first german products
mongoexport -d off -c products --type=csv --fields code,countries_tags -q '{"countries_tags": "en:germany"}}' --out report.csv --limit 5
# Export to STDIN in CSV format; notice option --quiet
mongoexport -d off -c products --type=csv --fields code,countries_tags -q '{"countries_tags": "en:germany"}' --limit 5 --quiet
# How long to export all German products?
time mongoexport -d off -c products --type=csv --fields code,countries_tags -q '{"countries_tags": "en:germany"}' --out report.csv
# real 0m10.135s
# Specify the fields in a file containing the line-separated list of fields to export (--fieldFile option)
# Official csv export fields are coming from @export_fields variable in /lib/ProductOpener/Config_off.pm
mongoexport -d off -c products --type=csv --fieldFile official_csv_export_fields.txt -q '{"countries_tags": "en:germany"}' --limit 5 --quiet
</pre>
==== List all fields used in the database ====
<pre>
# Open Food Facts database contains hundreds of fields.
# An easy way to list them all is to use "variety" Schema Analyzer:
# https://github.com/variety/variety
# 1. Install "variety"
git clone https://github.com/variety/variety.git
# 2. Use it
cd ./variety
# Analyzing can be very long (hours). You can restrict the analysis to a small number
time mongo off --eval "var collection = 'products', limit = 1000" variety.js > off_schema_1000.txt
# (17 s)
time mongo off --eval "var collection = 'products', limit = 10000" variety.js > off_schema_10000.txt
# (3 minutes)
time mongo off --eval "var collection = 'products', limit = 100000" variety.js > off_schema_100000.txt
# (75 minutes)
time mongo off --eval "var collection = 'products'" variety.js > off_schema_all.txt
# (more than two days)
</pre>
=== CSV export via SQL (beta) ===
We are testing a new kind of tool to provide the data: every day an SQL database is fed by the regular daily CSV export, and published online thanks to Datasette tool.
The tool, called ''[[Mirabelle]]'', can be found here: http://mirabelle.openfoodfacts.org/
The whole CSV export can be found here: http://mirabelle.openfoodfacts.org/products/all
* The tool supports simple queries with a form, and also facet navigation.
* For those who know SQL language, it allows rich and complex queries.
What's different with [https://world.openfoodfacts.org/cgi/search.pl Open Food Facts advanced search]?
* It's possible to export selections with more than 10,000 products (eg. big queries by countries).
* It's possible to build queries by date.
* It allows richer queries with OR, AND, NOT, REGEXP, etc.
* It is possible to restrict the number of fields displayed and exported.
* It is possible to order results by any field.
==== Example ====
'''1 -- Build your query (or ask someone to build it for you)'''
Eg. all German products that have been scanned at least one time.
-- Products from Germany that have been scanned at least one time
select code, product_name from [all]
where countries_en like "%germany%" and unique_scans_n is not null
order by unique_scans_n desc
-- the limit here displays 20 results; the link "CSV without limit" below allow you to download all the data without limit
limit 20
https://mirabelle.openfoodfacts.org/products?sql=--+Products+from+Germany+that+have+been+scanned+at+least+one+time%0D%0Aselect+code%2C+product_name+from+%5Ball%5D%0D%0Awhere+countries_en+like+%22%25germany%25%22+and+unique_scans_n+is+not+null%0D%0Aorder+by+unique_scans_n+desc%0D%0A--+the+limit+here+displays+20+results%3B+the+link+%22CSV+without+limit%22+below+allows+to+download+all+the+data+without+limit%0D%0Alimit+20
'''2 -- Click on the link "CSV without limit"'''
Maybe you have to wait several seconds. It will download a product.csv file.
==== Tips ====
* Several fields -- such as <code>countries_en</code>, <code>categories_en</code>, etc. -- contain multiple values. To query a particular value you have to use the operator <code>like</code> and use percents like this: <code>like %italy%</code>.

Latest revision as of 09:01, 12 July 2024

Open Food Facts data is released as Open Data: it can be reused freely by anyone, under the Open Database License (ODBL). While this page is related to practical reuse, you must really be aware of rights and duties provided by the Open Database License (ODBL).

Where is the data?

You'll find different kind of ways to get the data.

Looking for a selection of products?

Then use the advanced search. The Open Food Facts advanced search feature allows to download selections of the data. See: https://world.openfoodfacts.org/cgi/search.pl

When your search is done, you will be able to download the selection in CSV or Excel format, just give a try!

Do notice you can download up to 10,000 results only. If you need to download more results you have to use some other methods described below.

Looking for the whole database?

The whole database can be downloaded at https://world.openfoodfacts.org/data

It's very big. Open Food Facts hosts more than 2,800,000 products (as of April 2023). So you will probably need skills to reuse the data.

You'll be able to find here different kinds of data.

The MongoDB daily export

It represents the most complete data; it's very big and you have to know how to deal with MongoDB. It's very big! More than 30GB uncompressed.

The JSONL daily export

There is a daily export of the whole database in JSONL format (sometimes called LDJSON or NDJSON) where each line is a JSON object. It represents the same data as the MongoDB export. The file is 4,8GB (2022-10), compressed with gzip. It takes more than 14GB uncompressed.

You can find it at https://static.openfoodfacts.org/data/openfoodfacts-products.jsonl.gz

The CSV daily export

It contains all the products, but with a subset of the database fields. This subset is very large and include main characteristics (EAN, name, brand...), many tags (such as categories, origins, labels, packaging...), ingredients and nutrition facts. Thus, it is generally fitted to the majority of usages. It's a 2.3GB file (as of April 2020), so it can't be opened by Libre Office or Excel with an 8GB machine.

How to reuse?

CSV daily export

csvkit tips

csvkit is a very efficient tool to manipulate huge amounts of CSV data. Here are some useful tips to manipulate Open Food Facts CSV export.

Converting whole Open Food Facts "CSV" export to regular CSV. Open Food Facts export use tabs as separator: it should be called TSV (tab separated values) instead of CSV (comma separated values). csvkit can convert TSV file into CSV very easily:

$ csvclean -t en.openfoodfacts.org.products.csv > myCSV.csv

Selecting 2 columns. Selecting two or three columns can be useful for some usages. Extracting two columns produce a smaller CSV file which can be opened by common softwares such as Libre Office or Excel. The following command creates a CSV file (brands.csv) containing two columns from Open Food Facts (code and brands). (It generally takes more than 2 minutes, depending on your computer.)

$ csvcut -t -c code,brands en.openfoodfacts.org.products.csv > brands.csv

Selecting products based on a regular expression. csvkit can search in some specified fields, allowing to make powerful selections. The following command creates a CSV file (selection.csv) containing all products where the barcode (code) is beginning by 325798 (-r "^325798(.*)").

$ csvgrep -t -c code -r "^325798(.*)" en.openfoodfacts.org.products.csv > selection.csv

The following command creates a CSV file (calissons.csv) containing all products where the category (categories) is containing "calisson".

$ csvgrep -t -c categories -r "calisson" en.openfoodfacts.org.products.csv > calisson.csv

Import CSV in PostgreSQL

See this article: https://blog-postgresql.verite.pro/2018/12/21/import-openfoodfacts.html (in french, but should be understandable with Google Translator).

Alternative way - feel free to use a project from github: https://github.com/ArchiMageAlex/off_converter

Import CSV to SQLite

The repository foodrescue-content contains Ruby scripts that import Open Food Facts CSV data into a SQLite database with full table normalization. Only a few fields are imported so far, but this can be extended easily. Data imported so far includes:

  • barcode number
  • product name
  • product categories
  • product countries
  • full categories hierarchy imported from the categories.txt taxonomy (see)

Import CSV to DuckDB

DuckDB is very close to SQLite, except it has higher performances: database size is 3 times lighter, and requests performs 5-10 times better.

# Discard invalid characters
# duckdb doesn't like invalid UTF8. It did not want to read some parquet file as such, with the following error:
# Error: near line 1: Invalid Input Error: Invalid string encoding found in Parquet file: value "........."
# (occuring namely on this product: https://world.openfoodfacts.org/product/9900109008673?rev=4 )
# The issue, and its solution below, seems to be well-known: https://til.simonwillison.net/linux/iconv
iconv -f utf-8 -t utf-8 -c en.openfoodfacts.org.products.csv -o en.openfoodfacts.org.products.converted.csv

# Create duckdb database and import data
duckdb products.db <<EOF
CREATE TABLE products AS
    SELECT * FROM read_csv_auto('en.openfoodfacts.org.products.converted.csv', quote='', sample_size=3000000, delim='\t');
EOF

# Then you can try a SQL request
duckdb products.db -csv <<EOF
SELECT * FROM products
  WHERE completeness > 0.99 -- products with a good level of completeness
  ORDER BY last_modified_datetime LIMIT 10;
EOF

We wrote a detailled article about using DuckDB with the Open Food Facts database. Check it out!

Python

There are some articles dealing with using Python language to explore Open Food Facts data.

Step by step commands: http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/notebooks/prepare_data_2017.html (also in french)

Python notebooks are great to learn Open Food Facts data, as they mix code and results together:

R stat

For people who have R stat skills, there are more than 50 notebooks from Kaggle community.

Moreover, here a link to transform .bson file to a dataframe: https://github.com/gnaweric/openfoodfact_database_queries

With the use of {mongolite}, first connect to the base, then import the .bson file, then get a sample of it to make sure it is ready. Finally save it to a .rdata file for example.

Beware, each line is a product and some variable need to be unnest: tidyverser::unnest_wider()

JSONL delta exports

Every day, Open Food Facts exports all the products created during the last 24 hours. The documentation of this export can be found in the /data page.

If you don't have MongoDB and just want to use these delta exports to build an up-to-date database, you can merge each export with the help of jq tool.

$ gunzip products_1638076899_1638162314.json.gz # will decompress the file
$ wc -l products_1638076899_1638162314.json # will count the number of products in this export (in JSONL each line is a JSON object)
$ jq -c '. + .' 2021-11-30.json products_1638162314_1638248379.json > 2021-12-01.json # merge the delta with previous complete data

JSONL export

JSONL is a huge file! It's not possible to play with it with common editors or common tools. There is some command line tools that allows interesting things, like DuckDB or jq.

DuckDB

DuckDB recognizes the compressed JSONL file, there is no need to uncompress it. You can start playing with JSONL export with very few commands.

At first create a DuckDB file:

$ duckdb off.db

Then you can start playing with SQL commands.

SELECT code, product_name, countries_en FROM read_ndjson('openfoodfacts-products.jsonl.gz', ignore_errors=True) 
WHERE countries_en LIKE "%Germany%" LIMIT 10;

The advantages of DuckDB over the other solutions are that it is straightfoward and very fast.

If you want to go further, you can check the blog article from Jeremy Arancio.

jq

  • start decompress the file (be careful => 14GB after decompression):
$ gunzip openfoodfacts-products.jsonl.gz
  • work on a small subset to test. E.g. for 100 products:
$ head -n 100 openfoodfacts-products.jsonl > small.jsonl

You can start playing with jq. Here are examples.

$ cat small.jsonl | jq . # print all file in JSON format
$ cat small.jsonl | jq -r .code # print all products' codes.
$ cat small.jsonl | jq -r '[.code,.product_name] | @csv' # output CSV data containing code,product_name

Then you can try on the whole database:

$ cat openfoodfacts-products.jsonl | jq -r '[.code,.product_name] | @csv' > names.csv # output CSV file (name.csv) containing all products with code,product_name

If you don't have enough disk space to uncompress the .gz file, you can use zcat directly on the compressed file. Example:

$ zcat openfoodfacts-products.jsonl.gz | jq -r '[.code,.product_name] | @csv' # output CSV data containing code,product_name

Filtering JSONL export with jq

Filtering a specific country:

$ zcat openfoodfacts-products.jsonl.gz | jq '. | select(.countries_tags[]? == "en:germany")'

The previous command produces a json output containing all the products sold in Germany. If you want a JSONL output, add -c parameter.

$ zcat openfoodfacts-products.jsonl.gz | jq -c '. | select(.countries_tags[]? == "en:germany")'

You can add multiple filters and export the result to a CSV file. For example, here is a command that 1. selects products having the Nutri-Score computed and belonging to the TOP 90% most scanned products in 2020, and 2. exports barcode (code) and number of scans (scans_n) as a CSV file.

$ zcat openfoodfacts-products.jsonl.gz | jq -r '. | select(.misc_tags[]? == "en:nutriscore-computed" and .popularity_tags[]? == "top-90-percent-scans-2020") | [.code,.scans_n] | @csv' > displayed.ns.in.top90.2020.world.csv

Filtering barcodes which are different from a code containing 1 to 13 digits:

$ zcat openfoodfacts-products.jsonl.gz | jq -r '. | select(.code|test("^[0-9]{1,13}$") | not) | .code' > ean_gt_13.csv

Some part of the data are arrays, you must aggregate them using join for CSV export. For example, to export each product and its states in CSV:

$ zcat openfoodfacts-products.jsonl.gz | jq -r '[.code,(.states_tags|join(","))] | @csv'

Selecting products with quality issues and exporting the barcode and the issues in CSV:

$ zcat openfoodfacts-products.jsonl.gz | jq -r '. | select(.data_quality_errors_tags[]? != "")' | jq -r '[.code,(.data_quality_errors_tags|join(","))] | @csv'

These operations can be quite long (more than 10 minutes depending on your computer and your selection).

MongoDB dump

The MongoDB dump needs to be reused with MongoDB. It allows building a full replication of the Open Food Facts database and use MongoDB for selecting, filtering and exporting data. Using MongoDB allows faster manipulations compared to the other methods.

First, you need a running MongoDB installation. Open Food Facts is using MongoDB 4.4. It has been reported that prior version should not work for Open Food Facts dump.

You can see here a quick tutorial on how to install MongoDB on Debian 10 or Debian 11.

Import Open Food Facts MongoDB dump into MongoDB

# Download
wget https://static.openfoodfacts.org/data/openfoodfacts-mongodbdump.gz

# Import (no need to uncompress)
# --nsFrom=off.products           # source of the database / collection
# --nsTo=mydatabase.mycollection  # local database / collection
# --drop                          # Before restoring the collections from the dumped backup, drops the collections from the target database.
#                                 # --drop does not drop collections that are not in the backup.
mongorestore -vvvvv  --gzip --archive="./openfoodfacts-mongodbdump.gz" --nsFrom=off.products --nsTo=mydatabase.mycollection --drop

(Deprecated) Import Open Food Facts MongoDB dump into MongoDB

# Download and decompress the dump
wget https://static.openfoodfacts.org/data/openfoodfacts-mongodbdump.tar.gz
tar -xzf openfoodfacts-mongodbdump.tar.gz

# Restore all the database. mongorestore recreates indexes recorded by mongodump.
mongorestore --drop ./dump
# => 2254885 document(s) restored successfully. 0 document(s) failed to restore.

#On newer versions of mongorestore, you can use `--gzip --archive=openfoodfacts-mongodbdump.tar.gz`

Play with the database

# Display 5 first products in JSON format, using pagination
# https://www.codementor.io/@arpitbhayani/fast-and-efficient-pagination-in-mongodb-9095flbqr
mongo off --eval 'db.products.find().limit(5).pretty().shellPrint()' --quiet

# Combined with JQ (JSON tool) to provide colors
# JQ has to installed separatly. See https://stedolan.github.io/jq/
mongo off --eval 'db.products.find().limit(5).pretty().shellPrint()' --quiet | jq .

# Combined with JQ (JSON tool) to provide colors and compact output (each JSON object on a single line (aka JSONL format))
mongo off --eval 'db.products.find().limit(5).pretty().shellPrint()' --quiet | jq . -c

# Get products from Germany; return fields "code" and "counties_tags"; limit to 2 products
mongo off --eval 'db.products.find({countries_tags: "en:germany"}, {code: 1, countries_tags: 1}).limit(2).pretty().shellPrint()' --quiet

# get the data from one field without _id
mongo off --eval 'db.products.find({countries_tags: "en:germany"}, {_id: 0, countries_tags: 1}).limit(2).pretty().shellPrint()' --quiet

Export the database

# Exports
# See: https://www.mongodb.com/docs/database-tools/mongoexport/


# 1. The "aggregate" way
mongo off --eval 'db.products.aggregate([{$match: {product_name: "Coke"}},{$out: "result"}])'

mongoexport --db off --collection result --fields code,product_name --type=csv --out result.csv


# 2. the -q,--query option way

# Export 5 first german products 
mongoexport -d off -c products --type=csv --fields code,countries_tags -q '{"countries_tags": "en:germany"}}' --out report.csv --limit 5

# Export to STDIN in CSV format; notice option --quiet
mongoexport -d off -c products --type=csv --fields code,countries_tags -q '{"countries_tags": "en:germany"}' --limit 5 --quiet

# How long to export all German products?
time mongoexport -d off -c products --type=csv --fields code,countries_tags -q '{"countries_tags": "en:germany"}' --out report.csv
# real	0m10.135s

# Specify the fields in a file containing the line-separated list of fields to export (--fieldFile option)
# Official csv export fields are coming from @export_fields variable in /lib/ProductOpener/Config_off.pm
mongoexport -d off -c products --type=csv --fieldFile official_csv_export_fields.txt -q '{"countries_tags": "en:germany"}' --limit 5 --quiet

List all fields used in the database

# Open Food Facts database contains hundreds of fields.

# An easy way to list them all is to use "variety" Schema Analyzer:
# https://github.com/variety/variety

# 1. Install "variety"
git clone https://github.com/variety/variety.git

# 2. Use it
cd ./variety

# Analyzing can be very long (hours). You can restrict the analysis to a small number
time mongo off --eval "var collection = 'products', limit = 1000" variety.js > off_schema_1000.txt
# (17 s)

time mongo off --eval "var collection = 'products', limit = 10000" variety.js > off_schema_10000.txt
# (3 minutes)

time mongo off --eval "var collection = 'products', limit = 100000" variety.js > off_schema_100000.txt
# (75 minutes)

time mongo off --eval "var collection = 'products'" variety.js > off_schema_all.txt
# (more than two days)


CSV export via SQL (beta)

We are testing a new kind of tool to provide the data: every day an SQL database is fed by the regular daily CSV export, and published online thanks to Datasette tool.

The tool, called Mirabelle, can be found here: http://mirabelle.openfoodfacts.org/

The whole CSV export can be found here: http://mirabelle.openfoodfacts.org/products/all

  • The tool supports simple queries with a form, and also facet navigation.
  • For those who know SQL language, it allows rich and complex queries.

What's different with Open Food Facts advanced search?

  • It's possible to export selections with more than 10,000 products (eg. big queries by countries).
  • It's possible to build queries by date.
  • It allows richer queries with OR, AND, NOT, REGEXP, etc.
  • It is possible to restrict the number of fields displayed and exported.
  • It is possible to order results by any field.

Example

1 -- Build your query (or ask someone to build it for you)

Eg. all German products that have been scanned at least one time.

-- Products from Germany that have been scanned at least one time
select code, product_name from [all]
where countries_en like "%germany%" and unique_scans_n is not null
order by unique_scans_n desc
-- the limit here displays 20 results; the link "CSV without limit" below allow you to download all the data without limit
limit 20

https://mirabelle.openfoodfacts.org/products?sql=--+Products+from+Germany+that+have+been+scanned+at+least+one+time%0D%0Aselect+code%2C+product_name+from+%5Ball%5D%0D%0Awhere+countries_en+like+%22%25germany%25%22+and+unique_scans_n+is+not+null%0D%0Aorder+by+unique_scans_n+desc%0D%0A--+the+limit+here+displays+20+results%3B+the+link+%22CSV+without+limit%22+below+allows+to+download+all+the+data+without+limit%0D%0Alimit+20

2 -- Click on the link "CSV without limit"

Maybe you have to wait several seconds. It will download a product.csv file.

Tips

  • Several fields -- such as countries_en, categories_en, etc. -- contain multiple values. To query a particular value you have to use the operator like and use percents like this: like %italy%.