Jump to content

Ingredients Extraction and Analysis: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 110: Line 110:
** Test sets need to be created
** Test sets need to be created
*** Run spellcheckers on actual ingredients lists from OFF, review corrections
*** Run spellcheckers on actual ingredients lists from OFF, review corrections
== Ingredients analysis ==
=== Steps for ingredients analysis ===
==== Ingredients pre-parsing ====
* The ingredients list is transformed to make parsing easier
** Remove / normalize strange characters
** De-abbreviate abbreviations
** Split enumerations
*** e.g. "Vitamins A, B et C" -> Vitamine A, Vitamine B, Vitamine C
** Additives E-numbers normalization (E330, e330, e-330, INS 330, SIN330 etc.)
** Additives classes + additive splits
*** e.g. "Colour caramel" -> Colour: Caramel
** Split some "A of B, C and D" (but not all...)
*** e.g. "Huile de palme, colza et tournesol" -> Huile de palme, huile de colza, huile de tournesol
* Current solution
** Perl code and regular expressions
*** lib/ProductOpener/Ingredients.pm - preparse_ingredients_text()