1,082
edits
Line 116: | Line 116: | ||
Background: We have started in the past year to ramp up effort, and we have processed 1,5 million images with OCR and general entity, barcode and QR-code recognition. The result is 1,5 million matching JSON files with bounding boxes. | Background: We have started in the past year to ramp up effort, and we have processed 1,5 million images with OCR and general entity, barcode and QR-code recognition. The result is 1,5 million matching JSON files with bounding boxes. | ||
* '''Slack channels: #ai-machinelearning''' | * '''Slack channels: #ai-machinelearning #spellcheck''' | ||
* '''Github AI / machine learning: openfoodfacts-ai''' | * '''Github AI / machine learning: openfoodfacts-ai''' | ||
=== | === Ingredients spellcheck === | ||
* Ingredients lists from OCR very often contain errors that could be easily corrected if we build dedicated models for ingredients lists | |||
* We already have a large amount of correct ingredients lists in many languages that we could use to build dictionaries, compute frequencies, ngrams etc. | |||
* The solution needs to be easily retrained for new languages and for new training data so that it can continue to improve | |||
=== Data extraction from OCR and other field values === | |||
* Detect field values from other field values or bag of words from the OCR | * Detect field values from other field values or bag of words from the OCR | ||
Line 125: | Line 131: | ||
** Brands (in some cases, a strong feature can be the barcode prefix) | ** Brands (in some cases, a strong feature can be the barcode prefix) | ||
** Labels | ** Labels | ||
* When | * When precision is very high (99%), we can apply the results directly | ||
* | * For slightly lower precision, we can offer suggestions to users and ask them to confirm them | ||
=== Automatically detect errors === | === Automatically detect errors === |