Ingredients Extraction and Analysis: Difference between revisions

Ingredients Extraction and Analysis (view source)

1,118 bytes added , 10 September 2019

no edit summary

1,082

edits

@@ Line 128: / Line 128: @@
 ** Split some "A of B, C and D" (but not all...)
 *** e.g. "Huile de palme, colza et tournesol" -> Huile de palme, huile de colza, huile de tournesol
+** Handle * and other signs that indicate some ingredients are organic, fair trade etc.
+*** e.g. "Pomme*, ..., *: ingrédient issu de l'agriculture biologique" -> "Pomme bio"
 * Current solution
 ** Perl code and regular expressions
 *** lib/ProductOpener/Ingredients.pm - preparse_ingredients_text()
+==== Ingredients parsing ====
+* Separate individual ingredients and match them to the ingredients taxonomy
+** Extract properties of ingredients
+*** Labels like organic, fair trade etc.
+*** quantity (%)
+*** processing (e.g. "cooked")
+*** origin (e.g. "France")
+** Multi-level ingredients / sub-ingredients
+*** e.g. "Fromage (Lait, présure, sel)"
+** Recognize when "A and B" is a single ingredient, or 2 ingredients
+*** Uses the taxonomy to make the determination
+* Current solution
+** Perl code and regular expressions + multilingual ingredients taxonomy
+*** lib/ProductOpener/Ingredients.pm - extract_ingredients_from_text()
+== End to end metrics ==
+* For each product, we have the number of known and unknown ingredients
+== Ressources ===
+=== Data ===
+* Ingredients text and result of ingredient parsing in MongoDB JSON / JSONL exports: https://world.openfoodfacts.org/data
+=== Ingredients taxonomy ===
+* Definition:
+* JSON result: