Ingredients analysis and search features extraction: Difference between revisions
No edit summary |
|||
Line 44: | Line 44: | ||
In early March 2020, we recorded the quality metrics of ingredients analysis for major European languages, so that we could have a point of reference that we can compare to at the end of the project. | In early March 2020, we recorded the quality metrics of ingredients analysis for major European languages, so that we could have a point of reference that we can compare to at the end of the project. | ||
* [[Ingredients Analysis Quality Evaluation - March 2020]] | * [[Ingredients Analysis Quality Evaluation - March 2020 - May 2020]] | ||
== Improvements to ingredients analysis == | == Improvements to ingredients analysis == | ||
Line 131: | Line 131: | ||
== Results of improvements == | == Results of improvements == | ||
* [[Ingredients Analysis Quality Evaluation - May 2020]] | In mid May 2020, we recorded the quality metrics for ingredients analysis so that we could compare them to the early March 2020 metrics, and measure the impact of the improvements we made. | ||
* [[Ingredients Analysis Quality Evaluation - March 2020 - May 2020]] | |||
== Report, dissemination and next steps == | == Report, dissemination and next steps == |
Revision as of 14:10, 14 May 2020
Summary
Ingredients analysis and search features extraction is one of the 4 sub-tasks of the Project:Personalized_Search funded by the NGI0 Discovery Fund managed by NlNet.
This page documents the progress made in Q1 and Q2 2020.
Methods and infrastructure
Ingredients analysis has been gradually added to Open Food Facts in a very organic way (the first versions of additives and palm oil detections in French ingredients lists date from 2012). A lot of progress has been made over the years, but ingredients analysis has remained a complex, undocumented, and artisanal effort mostly focused on French, with only one developer coding it and very few people able to improve it.
The first focus of the project as thus been to develop methods and infrastructure to industrialize ingredients analysis and bring it to the next level for many more languages.
Documentation
Ingredients analysis is a complex process with several tasks that are done in sequence, with the output of each task becoming the input of the next task. We have greatly improved the documentation to make it easier for more people to contribute improvements to the code, data and tests of each task.
- Ingredients Extraction and Analysis : an high-level diagram + detailed information on how we perform ingredients extraction and analysis.
- How to improve ingredients analysis : lists the concrete actions that can be taken to improve ingredients analysis in a specific language.
Metrics definition, reporting and monitoring
In order to prioritize the work to improve ingredients analysis and monitor our progress, we have defined quality metrics and created tools to report them.
- Ingredients Analysis Quality : definition of the ingredients analysis quality metrics and instructions to retrieve those metrics for a specific sub-set of products (e.g. all products sold in a specific country).
We have also set up a monitoring system so that we can see the evolution of ingredients analysis metrics over time:
Visibility of ingredients analysis results and internals
To make it easier for more people to find, report and debug issues with ingredients analysis, we are now showing the result of ingredient analysis (whether a product is vegetarian, vegan or palm oil free) directly on the product page of the Open Food Facts web site, with a link to show exactly how we have parsed and analyzed the ingredients list.
Testing tool
In addition to seeing the details of the ingredients analysis for a specific product, users can also see the details of the analysis of an ingredient list they can type in, copy/paste, and modify in a simple web form.
This tool greatly facilitate debugging and creating minimal tests to reproduce issues.
Assessment of the Ingredients Analysis Quality for major EU languages
In early March 2020, we recorded the quality metrics of ingredients analysis for major European languages, so that we could have a point of reference that we can compare to at the end of the project.
Improvements to ingredients analysis
The quality of ingredients analysis directly depends on the quality of the input data (clean ingredients lists), the parsing features (whether we can recognize a given wording structure, from simple enumerations like "X and Y" to much more complex formulations), and the supporting data used by those features (the most important one being our multilingual ingredients taxonomy).
For this project, we worked to improve all 3 aspects.
Ingredients list cleaning
Wrong languages
Several thousands products had their fields (like the ingredient list) set to the wrong language (e.g. an Italian product with a list of ingredients in Italian recorded in the field for the ingredients list in French). This makes the ingredients analysis completely fail.
Most of those products where badly entered by 3rd party apps. The problem has been corrected (either in the apps or server side) so we don't have many new products with values set to the wrong language, but there is a big backlog of products that need to be corrected.
We have tools to detect products that have ingredients in a wrong language (e.g. https://fr.openfoodfacts.org/ingredient/fr:zucchero shows products where "zucchero" (Italian for sugar) is present in the French ingredients list).
And we created a simple tool to easily change the main language of a product and move the values of the field from an incorrect language to the correct language.
We also launched special missions for volunteers to review the products and fix them quickly. The volunteers have fixed several thousands products in the March / April / May 2020 timeframe.
Ingredients list cropping and new OCR extraction
Some apps have also sent us data where the ingredients are extracted through OCR, but not necessarily cut correctly. So in addition to the ingredients list, we can have other sentences, or lists of ingredients in multiple languages. As part of the special missions mentionned above, volunteers have also cropped photos of ingredients to select only the ingredients in one language, and have been re-running the OCR with much better results (OCR has made a lot of progress recently).
Spelling correction
We have started to work on automatic spell-correction for ingredients list. In particular, we set up tests set and a test infrastructure so that we can measure the accuracy of spell correction algorithms and decide on whether we should apply them with manual supervision, or if we can apply them directly (for algorithms with close to 100% precision).
- GitHub repository: https://github.com/openfoodfacts/openfoodfacts-ai/tree/master/spellcheck
- Slack channel: #spelling
Ingredients taxonomy improvements
The main goal of ingredients analysis is to detect each individual ingredient of the ingredients list and map it to our multilingual ingredients taxonomy, so building a comprehensive list of ingredients with all their possible synonyms in as many languages as possible is very important.
We are constantly improving the ingredients taxonomy, in different ways:
Translations of existing entries
Volunteers can add translations to their own language of already existing entries, thanks to an interface on the Open Food Facts web site. (e.g. https://pl.openfoodfacts.org/ingredients?translate=1 to translate ingredients to Polish).
Incorporation of the most frequent unknown ingredients
Contributors familiar with the taxonomy and with experience using Git can edit directly the ingredients taxonomy definition file to incorporate the most frequent unrecognized ingredients in a language (e.g. https://nl.openfoodfacts.org/ingredients?status=unknown&limit=1000 )
Unrecognized entries need to be added as a translation or as a synonym of an existing entry, or added as a new entry if we don't already have a corresponding entry in English or another language.
Ingredients processing taxonomy
A lot of ingredients list also contain information on how the ingredients have been processed (e.g. "cooked pork meat", "sliced tomatoes", "powdered garlic").
Instead of listing all possible combinations of processing for each ingredient in the ingredients taxonomy, we have created a taxonomy of processing methods that we use during ingredients parsing.
At this point, we are adding new processing methods and translations for them very carefully, as there is a risk of having false positives. Before adding new processing methods, we first look at the list of ingredients that contain them (using URLs like https://us.openfoodfacts.org/ingredients?filter=cooked )
The ingredients processing taxonomy is a huge improvement as it is going to drastically reduce the number of entries needed in the ingredients taxonomy.
Ingredients parsing features
The code for the ingredients parsing is in Ingredients.pm
In Q1 and Q2 2020, we made a lot of improvements, big and small.
In particular, we made 2 improvements that very significantly improve ingredients analysis and extend its impact:
Support for processing methods
Thanks to the ingredients processing taxonomy, we are now able to parse things like "cooked sliced tomatoes, smoked garlic powder" and map it to the individual ingredients "tomatoes" and "garlic".
Estimations of minimum and maximum percent ranges for ingredients
Most ingredients in lists of ingredients do not have an actual percentage specified, but they are listed in decreasing order of quantity. We have implemented an algorithm to compute the possible percent range of each ingredient, which makes it possible to estimate for instance the percentage of fruits and vegetables in a product. That percentage can then be used to search products, or as a criteria for ranking products. It is also used to compute a more precise Nutri-Score value.
Small parsing improvements
Some examples of smaller improvements:
- Handling enumerations of vitamins in Finnish
- A better match for the signs like * that indicate organic ingredients
- [1] Better handling of abbreviations like L. Acidophilus]
- Better handling for symbols like °
- Ingredients parsing improvements and test web interface
Results of improvements
In mid May 2020, we recorded the quality metrics for ingredients analysis so that we could compare them to the early March 2020 metrics, and measure the impact of the improvements we made.