Packagings data structure: Difference between revisions
No edit summary |
|||
(5 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
[[Category:Recycling]] | |||
= Introduction = | = Introduction = | ||
Line 103: | Line 104: | ||
* Remove boilerplate information ("consignedetri.fr") | * Remove boilerplate information ("consignedetri.fr") | ||
* Autocompletion would be handy | * Autocompletion would be handy | ||
* Inline edit to save time (inline image selection, and inline text extraction) | |||
* Bad parsing | |||
** https://world.openfoodfacts.org/product/3270160866786/merlu-blanc-sauce-au-cidre-picard | |||
** https://world.openfoodfacts.org/product/3560070539260/garniture-3-legumes-carrefour | |||
** https://world.openfoodfacts.org/product/3092718619008/sirop-de-peche-teisseire | |||
** https://world.openfoodfacts.org/product/3263850313158/poivre-blanc-leader-price "boîte et couvercle plastique à jetër" | |||
== Phase 4 - Develop a data entry UI for the new format (at least for the web site, possibly for mobile) == | == Phase 4 - Develop a data entry UI for the new format (at least for the web site, possibly for mobile) == |
Latest revision as of 13:54, 27 June 2021
Introduction
Open Food Facts has a very open, organic and iterative approach to data: most fields are initially not strictly defined and are often free text fields, and once we have enough data, we analyze it to determine how we can bring more structure to it, so that we can support more uses.
The most common approach is the OFF taxonomy system that brings multilingual entries that can be matched across languages, hierarchical entries, and properties.
Some fields require more complex approaches. For instance we turn free text ingredients list into nested arrays of ingredients and sub-ingredients entries using dedicated text parsing algorithms supported by several taxonomies for ingredients, ingredients properties, labels, countries etc.
Packagings are very similar to ingredients, in the sense that the more structure (and unfortunately very often complexity) we bring, the more uses we can support.
This page proposes a new data structure for packagings, and a phased approach to transition from the current system to the new system.
Current approach
We currently have a free text "packaging" field that is a "tags" field: values are comma separated and recorded in the "packaging_tags" field which can be accessed through the /packaging/ facet:
Issues with the current approach
- Different types of data: shapes of containers, materials, sizes, labels (e.g. FSC), properties (e.g. "to be recycled") etc.
- Varying levels of combination of data types (e.g. sometimes we have "plastic bottle", and sometimes "plastic" and "bottle"
- No hierarchy between entries (e.g. "recycled glass" is not associated to "glass")
- Entries in different languages, with no matching between them
- Flat tags structure makes it very difficult to distinguish individual packaging units and how many there are (e.g. 6-pack of 33cl cans)
New approach (proposal)
Structured array of packagings
To more accurately represent the packagings of a product and to support more uses, we can move from a flat representation (array of tags) to a more structured representation of the different units of packagings of a product, with an array of packaging units, each with different properties:
Number of units | Shape / Container | Materials | Labels | Quantity contained (Volume / Weight) | Size | Weight | Brands | Recycling instructions |
---|---|---|---|---|---|---|---|---|
(number) | (tags) | (tags) | (tags) | (value + unit) | (values + unit) | (value + unit) | (tags) | (tags) |
1 | Box | Cardboard | FSC-Mixed | 20x35x15 cm | 30 g | To be recycled | ||
6 | Bottle | White glass, Glass | 25 cl | 220 g | Reusable |
This new structure can support many more uses than the flat tags. For instance, for an environmental score, we could compute the amount of packaging for 100g/100cl of product. Six 25cl bottles generate more waste than 1 1.5L bottle.
For each product, the array of packaging units would be stored in a new "packagings" array of hashes:
[ { number=>1, shape=>"en:box", material=>"en:cardboard", labels=>"en:fsc, en:fsc-mixed", size_values=>"20x35x15", size_unit="cm", recycling=>"en:recycle"}, { number=>6, shape=>"en:bottle", material=>"en:white-glass,en:glass", quantity_value=>25, quantity_unit=>"cl", recycling=>"en:reuse"} ]
Packagings tags for facet search and discovery
Values would also be combined and stored in a new "packagings_tags" field for a new /packagings/ facet:
en:box, en:cardboard, en:cardboard-fsc, en:cardboard-fsc-mixed, en:carboard-box, en:cardboard-to-be-recycled etc. en:glass, en:glass-bottle, en:bottle-25-cl, en:glass-bottle-25-cl etc.
Note: Ideally we could generate all property combinations as individual tags values to support listing all products with for instance "reusable 25 cl glass bottles", but we may need to restrict it to some combinations of it causes performance issues due to too many combinations/
Separation of preservation methods
Preservation methods (e.g. "fresh", "frozen", "canned", "dried", "sous-vide", "protective gas" etc.) that are often in the existing packaging field are moved to a separate "preservation_methods" taxonomized tag field.
Migration steps
We need to convert as much of the existing data from the current approach to the new approach before deprecating the current approach.
Phase 1 - Develop supporting architecture for the new packagings and packagings_tags
- Develop an extract_packagins_from_tags function
- input: current packaging_tags field
- output: new packagings array
- components:
- new packagings containers, materials, labels, recycling instructions taxonomies
- to be adapted from existing taxonomies + bottom-up approach
- new packagings containers, materials, labels, recycling instructions taxonomies
- Develop a generate_packagings_tags function
- input: new packagings array
- output: new combined packagings properties tags
- Develop a display_packagins_tags function
- input: new combined packagings properties tags
- output: localized display values (e.g. en:25-cl-plastic-bottle -> "25 cl plastic bottle" + "bouteille plastique 25cl")
At the end of phase 1:
- The current packaging field read and write UI is still in place
- The current packaging and packaging_tags still exist
- New packagings and packagings_tags are starting to be available for some basic uses (e.g. for the Eco-Score computations)
Phase 2 - Incremental migration of the current packaging to the new packagings
Iterate over the matching algorithms and taxonomy entries so that most of the current data can be converted to the new format.
Phase 3 - Supplement the population of the new packagings from recycling instructions
Use recycling instructions (sent by the producer or extracted by the OCR) to supplement the population of the new packagings array.
Suggestions to improve parsing
- Remove boilerplate information ("consignedetri.fr")
- Autocompletion would be handy
- Inline edit to save time (inline image selection, and inline text extraction)
- Bad parsing
- https://world.openfoodfacts.org/product/3270160866786/merlu-blanc-sauce-au-cidre-picard
- https://world.openfoodfacts.org/product/3560070539260/garniture-3-legumes-carrefour
- https://world.openfoodfacts.org/product/3092718619008/sirop-de-peche-teisseire
- https://world.openfoodfacts.org/product/3263850313158/poivre-blanc-leader-price "boîte et couvercle plastique à jetër"
Phase 4 - Develop a data entry UI for the new format (at least for the web site, possibly for mobile)
Phase 5 - Deprecate current packaging field
- Remove the UI to write the current packaging field
- Deploy the UI to write the new packagings array