Jump to content

Ingredients analysis and search features extraction: Difference between revisions

no edit summary
No edit summary
Line 47: Line 47:


==== Wrong languages ====
==== Wrong languages ====
Several thousands products had their fields (like the ingredient list) set to the wrong language (e.g. an Italian product with a list of ingredients in Italian recorded in the field for the ingredients list in French). This makes the ingredients analysis completely fail.
Most of those products where badly entered by 3rd party apps. The problem has been corrected (either in the apps or server side) so we don't have many new products with values set to the wrong language, but there is a big backlog of products that need to be corrected.
We have tools to detect products that have ingredients in a wrong language (e.g. https://fr.openfoodfacts.org/ingredient/fr:zucchero shows products where "zucchero" (Italian for sugar) is present in the French ingredients list).
And we created a simple tool to easily change the main language of a product and move the values of the field from an incorrect language to the correct language.
We also launched special missions for volunteers to review the products and fix them quickly. The volunteers have fixed several thousands products in the March / April / May 2020 timeframe.


==== Ingredients list cropping and new OCR extraction ====
==== Ingredients list cropping and new OCR extraction ====
Some apps have also sent us data where the ingredients are extracted through OCR, but not necessarily cut correctly. So in addition to the ingredients list, we can have other sentences, or lists of ingredients in multiple languages. As part of the special missions mentionned above, volunteers have also cropped photos of ingredients to select only the ingredients in one language, and have been re-running the OCR with much better results (OCR has made a lot of progress recently).


==== Spelling correction ====
==== Spelling correction ====
We have started to work on automatic spell-correction for ingredients list. In particular, we set up tests set and a test infrastructure so that we can measure the accuracy of spell correction algorithms and decide on whether we should apply them with manual supervision, or if we can apply them directly (for algorithms with close to 100% precision).
* GitHub repository: https://github.com/openfoodfacts/openfoodfacts-ai/tree/master/spellcheck
* Slack channel: #spelling


=== Ingredients taxonomy improvements ===
=== Ingredients taxonomy improvements ===