Jump to content

Student projects/GSOC/Proposals: Difference between revisions

Line 116: Line 116:
Background: We have started in the past year to ramp up effort, and we have processed 1,5 million images with OCR and general entity, barcode and QR-code recognition. The result is 1,5 million matching JSON files with bounding boxes.
Background: We have started in the past year to ramp up effort, and we have processed 1,5 million images with OCR and general entity, barcode and QR-code recognition. The result is 1,5 million matching JSON files with bounding boxes.


* '''Slack channels: #ai-machinelearning'''
* '''Slack channels: #ai-machinelearning #spellcheck'''
* '''Github AI / machine learning: openfoodfacts-ai'''  
* '''Github AI / machine learning: openfoodfacts-ai'''  


=== Automatically classify products ===
=== Ingredients spellcheck ===
 
* Ingredients lists from OCR very often contain errors that could be easily corrected if we build dedicated models for ingredients lists
* We already have a large amount of correct ingredients lists in many languages that we could use to build dictionaries, compute frequencies, ngrams etc.
* The solution needs to be easily retrained for new languages and for new training data so that it can continue to improve
 
=== Data extraction from OCR and other field values ===


* Detect field values from other field values or bag of words from the OCR
* Detect field values from other field values or bag of words from the OCR
Line 125: Line 131:
** Brands (in some cases, a strong feature can be the barcode prefix)
** Brands (in some cases, a strong feature can be the barcode prefix)
** Labels
** Labels
* When certain, detected values can be applied immediately
* When precision is very high (99%), we can apply the results directly
* When less certain, we can ask users to confirm suggestions
* For slightly lower precision, we can offer suggestions to users and ask them to confirm them


=== Automatically detect errors ===
=== Automatically detect errors ===