Ingredient datasource synchronization: Difference between revisions
Jayaddison (talk | contribs) m (Add category) |
Jayaddison (talk | contribs) (Add first draft of reconciliation algorithm) |
||
(One intermediate revision by the same user not shown) | |||
Line 11: | Line 11: | ||
=== The value of synchronizing datasources === | === The value of synchronizing datasources === | ||
Synchronizing and cross-referencing the ingredients included in OpenFoodFacts with other datasources can be of mutual benefit | Synchronizing and cross-referencing the ingredients included in OpenFoodFacts with other datasources can be of mutual benefit to the quality of all datasources involved. | ||
To achieve progress, the goal should not be to achieve a single authoritative datasource - an objective that's unlikely to be either achievable or desirable across multiple projects and cultures with subjective and reasonable opinions of their own - but to enrich each datasource and to learn and provide additional (and ideally well-substantianted) supporting information that may help each participant in future. | To achieve progress, the goal should not be to achieve a single authoritative datasource - an objective that's unlikely to be either achievable or desirable across multiple projects and cultures with subjective and reasonable opinions of their own - but to enrich each datasource and to learn and provide additional (and ideally well-substantianted) supporting information that may help each participant in future. | ||
Sometimes cross-referencing an individual ingredient is straightforward: one OpenFoodFacts ingredient may map near-unambiguously to exactly one corresponding item in another datasource. In practice however, | Sometimes cross-referencing an individual ingredient is straightforward: one OpenFoodFacts ingredient may map near-unambiguously to exactly one corresponding item in another datasource. In practice however, ambiguities can and do occur for various reasons: | ||
* Naming ambiguities and the effects of natural-language translation (lime, lemon, citron). | * Naming ambiguities and the effects of natural-language translation (lime, lemon, citron). | ||
Line 25: | Line 25: | ||
=== Datasource: WikiData === | === Datasource: WikiData === | ||
WikiData is a [https://wiki.openfoodfacts.org/Structured_Data structured data] initiative hosted by the [https://foundation.wikimedia.org/wiki/Home WikiMedia foundation], and since it is universal in scope, food ingredients are part of the domain that it covers. Within WikiData, each data item is identified by a QID -- a unique number (integer) and prefixed by the letter 'Q'. For example, tomato has | WikiData is a [https://wiki.openfoodfacts.org/Structured_Data structured data] initiative hosted by the [https://foundation.wikimedia.org/wiki/Home WikiMedia foundation], and since it is universal in scope, food ingredients are part of the domain that it covers. Within WikiData, each data item is identified by a QID -- a unique number (integer) and prefixed by the letter 'Q'. For example, ''tomato'' has a WikiData QID of [https://www.wikidata.org/wiki/Q23501 Q23501]. | ||
The data model -- the schema -- is flexible and extensible, and is curated and maintained by the community. Categories of items can be created, and categorical properties with optional-or-expected values can be added and removed over time. Data quality varies; naively speaking, more popular items are likely to achieve greater eventual accuracy, whereas less-popular items tend to receive less attention. | The data model -- the schema -- is flexible and extensible, and is curated and maintained by the community. Categories of items can be created, and categorical properties with optional-or-expected values can be added and removed over time. Data quality varies; naively speaking, more popular items are likely to achieve greater eventual accuracy, whereas less-popular items tend to receive less attention. | ||
Line 37: | Line 37: | ||
=== Process: how to reconcile two datasources === | === Process: how to reconcile two datasources === | ||
Our objective here is to take two datasources - one that we will call the '''origin''' datasource, and one that we will call the '''destination''' datasource, and to improve at least one of them. | |||
It may be possible to automate much of this process - however to begin with, this page describes a manual, iterative process that is time-consuming but allows for subjective decision-making and human resolution of ambiguities. | |||
====== Preparation ====== | |||
Before starting, it's worth comparing some essential statistics about the datasources: | |||
<nowiki>*</nowiki> How many records (items) are included in each datasource? | |||
<nowiki>*</nowiki> What can be used as a unique identifier for an item in each datasource? | |||
<nowiki>*</nowiki> Does each datasource include a way to cross-reference (link to) a corresponding item in the other datasource? | |||
<nowiki>*</nowiki> What methods are available to search the contents of each datasource? | |||
The answers to these questions may influence the reconciliation process and how it is best achieved. | |||
====== Algorithm ====== | |||
# Begin by taking a snapshot (a read-only copy) of the '''origin''' datasource, so that you can confirm that you have processed all of the items that it contained when you are finished. | |||
# For each item in the snapshot: | |||
## Extract the information required to search for the item in the '''destination''' datasource. | |||
## Search for the item in the '''destination'''. | |||
## Does the item exist? | |||
### Yes | |||
#### Add a cross-reference from this item to the '''destination''' item. | |||
#### Check for any metadata on the '''destination''' item that appears to be more up-to-date and/or accurate than this item, and where it seems safe and correct to do so, update this item with that higher-quality information. | |||
#### Does the '''destination''' item have an existing cross-reference back to this item? | |||
##### Yes | |||
###### Is the cross-reference to ''this'' item? | |||
####### Yes | |||
######## Success: continue | |||
####### No | |||
######## Potential conflict. | |||
##### No | |||
###### Provide a suggestion to the '''destination''' datasource to add the missing cross-reference. | |||
### No | |||
#### Provide a suggestion to the '''destination''' datasource to add the missing item. | |||
The most difficult scenario here is when a '''destination''' item already exists but refers to a different item in the '''origin''' dataset. This often indicates some kind of difference in representation of ingredients, or a different hierarchical ordering of those ingredients. It's difficult to proscribe an algorithm to resolve this automatically: some amount of common sense, investigation, research and question-asking are usually required to figure out a good solution. If in doubt, write down a note about the ambiguity, and after you have finished working through all of the '''origin''' items, take a look at the notes collected. There may be a pattern, and if so that may help to ask better questions about how to resolve the inter-datasource ambiguity more comprehensively. Answers are not guaranteed: in some cases, further discussion, and potentially evolution of the relevant schemas, may be required. | |||
[[Category:Ingredients]] | [[Category:Ingredients]] |
Latest revision as of 22:41, 2 November 2023
Disclaimer: this article originates from the author's perspective developing RecipeRadar, a recipe search engine project that is independent of OpenFoodFacts. As such this page should be critically reviewed to ensure that it meets acceptability and accuracy levels for continued inclusion in the OpenFoodFacts Wiki.
Background: Ingredients in OpenFoodFacts
The products listed in OpenFoodFacts include information about the ingredients that they contain, and where possible those constituent ingredients are matched to entries in the OpenFoodFacts ingredients taxonomy -- one of the multiple taxonomies that the project provides.
To accommodate internationalization and localization, ingredients within OpenFoodFacts are identified by one-or-more language:name pairs. So, for example, en:apple, fr:pomme and it:mela are three equally-valid identifiers that all refer to the same popular edible fruit.
The ingredients taxonomy includes an hierarchical structure: any named ingredient may optionally indicate that it is a more-specific variant of a less-specific parent ingredient. So to continue the example, de:Apfelmark is a refinement of fr:pomme. Whether the refinement is a natural, linguistic, cultural or food-preparation derivation of the parent ingredient is undefined, allowing for flexibility in usage.
The value of synchronizing datasources
Synchronizing and cross-referencing the ingredients included in OpenFoodFacts with other datasources can be of mutual benefit to the quality of all datasources involved.
To achieve progress, the goal should not be to achieve a single authoritative datasource - an objective that's unlikely to be either achievable or desirable across multiple projects and cultures with subjective and reasonable opinions of their own - but to enrich each datasource and to learn and provide additional (and ideally well-substantianted) supporting information that may help each participant in future.
Sometimes cross-referencing an individual ingredient is straightforward: one OpenFoodFacts ingredient may map near-unambiguously to exactly one corresponding item in another datasource. In practice however, ambiguities can and do occur for various reasons:
* Naming ambiguities and the effects of natural-language translation (lime, lemon, citron). * Differing levels of dataset coverage, often corresponding to cultural familiarity and ingredient availability (black beans). * Differences in dataset granularity (tomato, cherry tomato, plum tomato).
Whether to accept an edit (addition, update or removal) of an entry in a datasource is the responsibility of the recipient; suggestions can be made, but those suggestions may be modified before they're accepted, with or without the opportunity for dialogue, or they may be rejected.
Datasource: WikiData
WikiData is a structured data initiative hosted by the WikiMedia foundation, and since it is universal in scope, food ingredients are part of the domain that it covers. Within WikiData, each data item is identified by a QID -- a unique number (integer) and prefixed by the letter 'Q'. For example, tomato has a WikiData QID of Q23501.
The data model -- the schema -- is flexible and extensible, and is curated and maintained by the community. Categories of items can be created, and categorical properties with optional-or-expected values can be added and removed over time. Data quality varies; naively speaking, more popular items are likely to achieve greater eventual accuracy, whereas less-popular items tend to receive less attention.
Datasource: RecipeRadar
RecipeRadar is a community-benefit company that provides free-and-open-source recipe search engine and meal planning tools to the public at no cost. Much of the provision of that service relies upon a manually-curated dataset of ingredients; those ingredients become the query terms that users can search for, and also become the document terms that are indexed in the RecipeRadar corpus of recipe webpages retrieved from the world wide web.
In contrast to WikiData, the data model of RecipeRadar is fairly static, hierarchical and managed by a single company. This data model does include a field (database column) that provides the ability for a listed ingredient to cross-reference a WikiData QID -- and to that extent it is possible to cross-walk from either RecipeRadar or OpenFoodFacts to each other, with some potential for missing entries either at the WikiData step or beyond.
Process: how to reconcile two datasources
Our objective here is to take two datasources - one that we will call the origin datasource, and one that we will call the destination datasource, and to improve at least one of them.
It may be possible to automate much of this process - however to begin with, this page describes a manual, iterative process that is time-consuming but allows for subjective decision-making and human resolution of ambiguities.
Preparation
Before starting, it's worth comparing some essential statistics about the datasources:
* How many records (items) are included in each datasource?
* What can be used as a unique identifier for an item in each datasource?
* Does each datasource include a way to cross-reference (link to) a corresponding item in the other datasource?
* What methods are available to search the contents of each datasource?
The answers to these questions may influence the reconciliation process and how it is best achieved.
Algorithm
- Begin by taking a snapshot (a read-only copy) of the origin datasource, so that you can confirm that you have processed all of the items that it contained when you are finished.
- For each item in the snapshot:
- Extract the information required to search for the item in the destination datasource.
- Search for the item in the destination.
- Does the item exist?
- Yes
- Add a cross-reference from this item to the destination item.
- Check for any metadata on the destination item that appears to be more up-to-date and/or accurate than this item, and where it seems safe and correct to do so, update this item with that higher-quality information.
- Does the destination item have an existing cross-reference back to this item?
- Yes
- Is the cross-reference to this item?
- Yes
- Success: continue
- No
- Potential conflict.
- Yes
- Is the cross-reference to this item?
- No
- Provide a suggestion to the destination datasource to add the missing cross-reference.
- Yes
- No
- Provide a suggestion to the destination datasource to add the missing item.
- Yes
The most difficult scenario here is when a destination item already exists but refers to a different item in the origin dataset. This often indicates some kind of difference in representation of ingredients, or a different hierarchical ordering of those ingredients. It's difficult to proscribe an algorithm to resolve this automatically: some amount of common sense, investigation, research and question-asking are usually required to figure out a good solution. If in doubt, write down a note about the ambiguity, and after you have finished working through all of the origin items, take a look at the notes collected. There may be a pattern, and if so that may help to ask better questions about how to resolve the inter-datasource ambiguity more comprehensively. Answers are not guaranteed: in some cases, further discussion, and potentially evolution of the relevant schemas, may be required.