Ingredient datasource synchronization

From Open Food Facts wiki
Revision as of 22:06, 31 October 2023 by Jayaddison (talk | contribs) (Add initial page structure)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Disclaimer: this article originates from the author's perspective developing RecipeRadar, a recipe search engine project that is independent of OpenFoodFacts. As such this page should be critically reviewed to ensure that it meets acceptability and accuracy levels for continued inclusion in the OpenFoodFacts Wiki.

Background: Ingredients in OpenFoodFacts

The products listed in OpenFoodFacts include information about the ingredients that they contain, and where possible those constituent ingredients are matched to entries in the OpenFoodFacts ingredients taxonomy -- one of the multiple taxonomies that the project provides.

To accommodate internationalization and localization, ingredients within OpenFoodFacts are identified by one-or-more language:name pairs. So, for example, en:apple, fr:pomme and it:mela are three equally-valid identifiers that all refer to the same popular edible fruit.

The ingredients taxonomy includes an hierarchical structure: any named ingredient may optionally indicate that it is a more-specific variant of a less-specific parent ingredient. So to continue the example, de:Apfelmark is a refinement of fr:pomme. Whether the refinement is a natural, linguistic, cultural or food-preparation derivation of the parent ingredient is undefined, allowing for flexibility in usage.

The value of synchronizing datasources

Synchronizing and cross-referencing the ingredients included in OpenFoodFacts with other datasources can be of mutual benefit the the quality of all datasources involved.

To achieve progress, the goal should not be to achieve a single authoritative datasource - an objective that's unlikely to be either achievable or desirable across multiple projects and cultures with subjective and reasonable opinions of their own - but to enrich each datasource and to learn and provide additional (and ideally well-substantianted) supporting information that may help each participant in future.

Sometimes cross-referencing an individual ingredient is straightforward: one OpenFoodFacts ingredient may map near-unambiguously to exactly one corresponding item in another datasource. In practice however, various ambiguities can and do occur for various reasons:

 * Naming ambiguities and the effects of natural-language translation (lime, lemon, citron).
 * Differing levels of dataset coverage, often corresponding to cultural familiarity and ingredient availability (black beans).
 * Differences in dataset granularity (tomato, cherry tomato, plum tomato).

Whether to accept an edit (addition, update or removal) of an entry in a datasource is the responsibility of the recipient; suggestions can be made, but those suggestions may be modified before they're accepted, with or without the opportunity for dialogue, or they may be rejected.

Datasource: WikiData

WikiData is a structured data initiative hosted by the WikiMedia foundation, and since it is universal in scope, food ingredients are part of the domain that it covers. Within WikiData, each data item is identified by a QID -- a unique number (integer) and prefixed by the letter 'Q'. For example, tomato has the WikiData QID of Q23501.

The data model -- the schema -- is flexible and extensible, and is curated and maintained by the community. Categories of items can be created, and categorical properties with optional-or-expected values can be added and removed over time. Data quality varies; naively speaking, more popular items are likely to achieve greater eventual accuracy, whereas less-popular items tend to receive less attention.

Datasource: RecipeRadar

RecipeRadar is a community-benefit company that provides free-and-open-source recipe search engine and meal planning tools to the public at no cost. Much of the provision of that service relies upon a manually-curated dataset of ingredients; those ingredients become the query terms that users can search for, and also become the document terms that are indexed in the RecipeRadar corpus of recipe webpages retrieved from the world wide web.

In contrast to WikiData, the data model of RecipeRadar is fairly static, hierarchical and managed by a single company. This data model does include a field (database column) that provides the ability for a listed ingredient to cross-reference a WikiData QID -- and to that extent it is possible to cross-walk from either RecipeRadar or OpenFoodFacts to each other, with some potential for missing entries either at the WikiData step or beyond.

Process: how to reconcile two datasources

TODO