Jump to content

Ingredients Extraction and Analysis: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 128: Line 128:
** Split some "A of B, C and D" (but not all...)
** Split some "A of B, C and D" (but not all...)
*** e.g. "Huile de palme, colza et tournesol" -> Huile de palme, huile de colza, huile de tournesol
*** e.g. "Huile de palme, colza et tournesol" -> Huile de palme, huile de colza, huile de tournesol
** Handle * and other signs that indicate some ingredients are organic, fair trade etc.
*** e.g. "Pomme*, ..., *: ingrédient issu de l'agriculture biologique" -> "Pomme bio"
* Current solution
* Current solution
** Perl code and regular expressions
** Perl code and regular expressions
*** lib/ProductOpener/Ingredients.pm - preparse_ingredients_text()
*** lib/ProductOpener/Ingredients.pm - preparse_ingredients_text()
==== Ingredients parsing ====
* Separate individual ingredients and match them to the ingredients taxonomy
** Extract properties of ingredients
*** Labels like organic, fair trade etc.
*** quantity (%)
*** processing (e.g. "cooked")
*** origin (e.g. "France")
** Multi-level ingredients / sub-ingredients
*** e.g. "Fromage (Lait, présure, sel)"
** Recognize when "A and B" is a single ingredient, or 2 ingredients
*** Uses the taxonomy to make the determination
* Current solution
** Perl code and regular expressions + multilingual ingredients taxonomy
*** lib/ProductOpener/Ingredients.pm - extract_ingredients_from_text()
== End to end metrics ==
* For each product, we have the number of known and unknown ingredients
== Ressources ===
=== Data ===
* Ingredients text and result of ingredient parsing in MongoDB JSON / JSONL exports: https://world.openfoodfacts.org/data
=== Ingredients taxonomy ===
* Definition:
* JSON result: