Reusing Open Food Facts Data

From Open Food Facts wiki
Revision as of 16:47, 25 March 2021 by Charlesnepote (talk | contribs) (Multiple filter and CSV export)

Open Food Facts data is released as Open Data: it can be reused freely by anyone, under the Open Database License (ODBL). While this page is related to practical reuse, you must really be aware of rights and duties provided by the Open Database License (ODBL).

Where is the data?

You'll find different kind of ways to get the data.

Looking for a selection of products?

Then use the advanced search. The Open Food Facts advanced search feature allows to download selections of the data. See: https://world.openfoodfacts.org/cgi/search.pl

When your search is done, you will be able to download the selection in CSV or Excel format, just give a try!

Looking for the whole database?

The whole database can be downloaded at https://world.openfoodfacts.org/data

It's very big. Open Food Facts hosts more than 1,400,000 products (as of July 2020). So you will probably need skills to reuse the data.

You'll be able to find here different kinds of data.

The MongoDB daily export

It represents the most complete data; it's very big and you have to know how to deal with MongoDB. It's very big! More than 9GB uncompressed.

The JSONL daily export

While still undocumented, there is a daily export of the whole database in JSONL format (sometimes called LDJSON or NDJSON) where each line is a JSON object. It represents the same data as the MongoDB export. The file is 2,7GB (2020-09), compressed with gzip. It takes more than 14GB uncompressed.

You can find it at https://static.openfoodfacts.org/data/openfoodfacts-products.jsonl.gz

The CSV daily export

It contains all the products, but with a subset of the database fields. This subset is very large and include main characteristics (EAN, name, brand...), many tags (such as categories, origins, labels, packaging...), ingredients and nutrition facts. Thus, it is generally fitted to the majority of usages. It's a 2.3GB file (as of April 2020), so it can't be opened by Libre Office or Excel with an 8GB machine.

How to reuse?

CSV daily export

csvkit tips

csvkit is a very efficient tool to manipulate huge amounts of CSV data. Here are some useful tips to manipulate Open Food Facts CSV export.

Selecting 2 columns. Selecting two or three columns can be useful for some usages. Extracting two columns produce a smaller CSV file which can be opened by common softwares such as Libre Office or Excel. The following command creates a CSV file (brands.csv) containing two columns from Open Food Facts (code and brands). (It generally takes more than 2 minutes, depending on your computer.)

$ csvcut -t -c code,brands en.openfoodfacts.org.products.csv > brands.csv

Selecting products based on a regular expression. csvkit can search in some specified fields, allowing to make powerful selections. The following command creates a CSV file (selection.csv) containing all products where the barcode (code) is beginning by 325798 (-r "^325798(.*)").

$ csvgrep -t -c code -r "^325798(.*)" en.openfoodfacts.org.products.csv > selection.csv

The following command creates a CSV file (calissons.csv) containing all products where the category (categories) is containing "calisson".

$ csvgrep -t -c categories -r "calisson" en.openfoodfacts.org.products.csv > calisson.csv

Import CSV in PostgreSQL

See this article: https://blog-postgresql.verite.pro/2018/12/21/import-openfoodfacts.html (in french, but should be understandable with Google Translator).

Alternative way - feel free to use a project from github: https://github.com/ArchiMageAlex/off_converter

Import CSV to SQLite

The repository foodrescue-content contains Ruby scripts that import Open Food Facts CSV data into a SQLite database with full table normalization. Only a few fields are imported so far, but this can be extended easily. Data imported so far includes:

  • barcode number
  • product name
  • product categories
  • product countries
  • full categories hierarchy imported from the categories.txt taxonomy (see)

Python

There are some articles dealing with using Python language to explore Open Food Facts data.

Step by step commands: http://www.xavierdupre.fr/app/ensae_teaching_cs/helpsphinx/notebooks/prepare_data_2017.html (also in french)

Python notebooks are great to learn Open Food Facts data, as they mix code and results together:

R stat

For people who have R stat skills, there are more than 50 notebooks from Kaggle community.

Moreover, here a link to transform .bson file to a dataframe: https://github.com/gnaweric/openfoodfact_database_queries

With the use of {mongolite}, first connect to the base, then import the .bson file, then get a sample of it to make sure it is ready. Finally save it to a .rdata file for example.

Beware, each line is a product and some variable need to be unnest: tidyverser::unnest_wider()

JSONL export

JSONL is a huge file! It's not possible to play with it with common editors or common tools. But there is some command line tools that allows interesting things, like jq.

jq

  • start decompress the file (be carreful => 14GB after decompression):
$ gunzip openfoodfacts-products.jsonl.gz
  • work on a small subset to test. E.g. for 100 products:
$ head -n 100 openfoodfacts-products.jsonl > small.jsonl

You can start playing with jq. Here are examples.

$ cat small.jsonl | jq . # print all file in JSON format
$ cat small.jsonl | jq -r .code # print all products' codes.
$ cat small.jsonl | jq -r '[.code,.product_name] | @csv' # output CSV data containing code,product_name

Then you can try on the whole database:

$ cat openfoodfacts-products.jsonl | jq -r '[.code,.product_name] | @csv' > names.csv # output CSV file (name.csv) containing all products with code,product_name

If you don't have enough disk space to uncompress the .gz file, you can use zcat directly on the compressed file. Example:

$ zcat openfoodfacts-products.jsonl.gz | jq -r '[.code,.product_name] | @csv' # output CSV data containing code,product_name

Filtering JSONL export with jq

Filtering a specific country:

$ zcat openfoodfacts-products.jsonl.gz | jq '. | select(.countries_tags[]? == "en:germany")'

The previous command produces a json output containing all the products sold in Germany. If you want a JSONL output, add -c parameter.

$ zcat openfoodfacts-products.jsonl.gz | jq -c '. | select(.countries_tags[]? == "en:germany")'

You can add multiple filters and export the result to a CSV file. For example, here is a command that 1. selects products having the Nutri-Score computed and belonging to the TOP 90% most scanned products in 2020, and 2. exports barcode (code) and number of scans (scans_n) as a CSV file.

$ zcat openfoodfacts-products.jsonl.gz | jq -r '. | select(.misc_tags[]? == "en:nutriscore-computed" and .popularity_tags[]? == "top-90-percent-scans-2020") | [.code,.scans_n] | @csv' > displayed.ns.in.top90.2020.world.csv

These operations can be quite long (more than 10 minutes depending on your computer and your selection).