Ingredients Analysis Quality Evaluation - August 2023: Difference between revisions

From Open Food Facts wiki
(Created page with "Evaluation of the quality of Ingredients Extraction and Analysis so that we can measure the improvements. == Ingredients parsing and recognition == For each product, the...")
 
No edit summary
 
(10 intermediate revisions by 3 users not shown)
Line 8: Line 8:


* The input ingredient list is incorrect (misspellings etc.)
* The input ingredient list is incorrect (misspellings etc.)
* The ingredients list contains other things than ingredients and we have not been able to split it at the right place.
* The ingredients list contains other things than ingredients, and we have not been able to split it at the right place.
* We have not been able to parse a particular sentence structure or formatting (e.g. "a drop of delicious honey with a shower of powder sugar")
* We have not been able to parse a particular sentence structure or formatting (e.g., "a drop of delicious honey with a shower of powder sugar")
* The ingredient is not yet present in our taxonomy (or not with the right synonym or in the right language)
* The ingredient is not yet present in our taxonomy (or not with the right synonym or in the right language)


In the table below,
In the table below,
* the first column of numbers corresponds to the '''number of unique ingredients''' across all products.
* the first column of numbers corresponds to the '''number of unique ingredients''' across all products.
* And the second column of number corresponds to the '''number of occurrences''' of those ingredients. e.g. if a specific ingredient appears in 5 products, it is counted 5 times.
* And the second column of number corresponds to the '''number of occurrences''' of those ingredients. e.g., if a specific ingredient appears in 5 products, it is counted 5 times.




Line 34: Line 34:
  ||
  ||
<pre>
<pre>
2593 原材料:


Type Unique tags Occurrences
known 766 (29.54%) 8466 (77.51%)
unknown 1827 (70.46%) 2456 (22.49%)
all 2593 (100.00%) 10922 (100.00%)
</pre>
</pre>
|-
|-
| [https://jp.openfoodfacts.org/popularity/top-10000-in-scans-2022/ingredients?stats=1&no_cache=1 Top 10k most scanned products:] ||  
| [https://jp.openfoodfacts.org/popularity/top-10000-jp-scans-2022/ingredients?stats=1&no_cache=1 Top 10k most scanned products:] ||  
<pre>
<pre>
16 原材料:


Type Unique tags Occurrences
known 15 (93.75%) 15 (93.75%)
unknown 1 (6.25%) 1 (6.25%)
all 16 (100.00%) 16 (100.00%)
</pre>
</pre>
||
||
Line 56: Line 56:
{| class="wikitable"
{| class="wikitable"
|-
|-
!  !! 2023-07-30 || 2022-07-31
!  !! 2023-07-30 || 2023-07-31 || 2023-09-20
|-
|-
| [https://hr.openfoodfacts.org/ingredients?stats=1&no_cache=1 All products:] ||
| [https://hr.openfoodfacts.org/ingredients?stats=1&no_cache=1 All products:] ||
Line 69: Line 69:
  ||
  ||
<pre>
<pre>
5254 sastojci:


Type Unique tags Occurrences
known 1197 (22.78%) 26194 (84.69%)
unknown 4057 (77.22%) 4735 (15.31%)
all 5254 (100.00%) 30929 (100.00%)
</pre>
||
<pre>
3420 sastojci:
Type Unique tags Occurrences
known 1413 (41.32%) 29483 (92.97%)
unknown 2007 (58.68%) 2229 (7.03%)
all 3420 (100.00%) 31712 (100.00%)
</pre>
</pre>
|-
|-
| [https://hr.openfoodfacts.org/popularity/top-10000-in-scans-2022//ingredients?stats=1&no_cache=1 Top 10k most scanned products:] ||  
| [https://hr.openfoodfacts.org/popularity/top-10000-hr-scans-2022/ingredients?stats=1&no_cache=1 Top 10k most scanned products:] ||  
<pre>
<pre>
45 sastojci:
 
</pre>
||
 
 
|}
 
=== PL - Polish ===
 
{| class="wikitable"
|-
!  || 2022-08-16!! 2023-08-15
!2023-12-01
|-
| [https://pl.openfoodfacts.org/ingredients?stats=1&no_cache=1 All products:]
||
<pre>
14121 składniki:
 
Type Unique tags Occurrences
known 1918 (13.58%) 127042 (89.13%)
unknown 12203 (86.42%) 15488 (10.87%)
all 14121 (100.00%) 142530 (100.00%)
</pre>||
<pre>
16160 składniki:


Type Unique tags Occurrences
Type Unique tags Occurrences
known 36 (80.00%) 46 (83.64%)
known 2048 (12.67%) 116070 (84.73%)
unknown 9 (20.00%) 9 (16.36%)
unknown 14112 (87.33%) 20910 (15.27%)
all 45 (100.00%) 55 (100.00%)
all 16160 (100.00%) 136980 (100.00%)
</pre>
</pre>
|<pre>
16478 składniki:
Type Unique tags Occurrences
known 1957 (11.88%) 135344 (87.87%)
unknown 14521 (88.12%) 18689 (12.13%)
all 16478 (100.00%) 154033 (100.00%)
</pre>
|-
| [https://pl.openfoodfacts.org/popularity/top-10000-pl-scans-2022/ingredients?stats=1&no_cache=1 Top 10k most scanned products:]
||
||
<pre>
3627 składniki:


Type Unique tags Occurrences
known 1290 (35.57%) 34703 (92.74%)
unknown 2337 (64.43%) 2718 (7.26%)
all 3627 (100.00%) 37421 (100.00%)
</pre>||
<pre>
4406 składniki:
Type Unique tags Occurrences
known 1296 (29.41%) 31605 (88.30%)
unknown 3110 (70.59%) 4189 (11.70%)
all 4406 (100.00%) 35794 (100.00%)
</pre>
|<pre>
3809 składniki:
Type Unique tags Occurrences
known 1309 (34.37%) 35540 (92.39%)
unknown 2500 (65.63%) 2926 (7.61%)
all 3809 (100.00%) 38466 (100.00%)
</pre>
|}
|}


=== UK + English ===
{| class="wikitable"
|-
!  !! 2023-09-14
|'''2023-12-01'''
|-
| [https://uk.openfoodfacts.org/ingredients?stats=1&no_cache=1 All products:] ||
<pre>
60330 ingredients:
Type Unique tags Occurrences
known 2840 (4.71%) 649282 (88.62%)
unknown 57490 (95.29%) 83403 (11.38%)
all 60330 (100.00%) 732685 (100.00%)
</pre>
||
<pre>
63758 ingredients:
Type Unique tags Occurrences
known 2916 (4.57%) 723156 (89.04%)
unknown 60842 (95.43%) 89015 (10.96%)
all 63758 (100.00%) 812171 (100.00%)
</pre>
|-
| [https://uk.openfoodfacts.org/popularity/top-10000-gb-scans-2022/ingredients?stats=1&no_cache=1 Top 10k most scanned products:] ||
<pre>
7243 ingredients:
Type Unique tags Occurrences
known 1728 (23.86%) 79991 (92.23%)
unknown 5515 (76.14%) 6740 (7.77%)
all 7243 (100.00%) 86731 (100.00%)
</pre>
||
<pre>
7223 ingredients:
Type Unique tags Occurrences
known 1760 (24.37%) 83819 (92.64%)
unknown 5463 (75.63%) 6661 (7.36%)
all 7223 (100.00%) 90480 (100.00%)</pre>
|}
=== Observations ===
=== Observations ===


Line 90: Line 206:
[[Category:Project:Personalized_Search]]
[[Category:Project:Personalized_Search]]
[[Category:Data quality]]
[[Category:Data quality]]
[[Category:Ingredients]]
[[Category:Metrics]]

Latest revision as of 10:43, 7 August 2024

Evaluation of the quality of Ingredients Extraction and Analysis so that we can measure the improvements.

Ingredients parsing and recognition

For each product, the ingredients list is parsed to separate each ingredient. Ingredients that we can match to our multilingual ingredients taxonomy as marked as "known", others as "unknown".

There are different reasons an ingredient can be marked as unknown:

  • The input ingredient list is incorrect (misspellings etc.)
  • The ingredients list contains other things than ingredients, and we have not been able to split it at the right place.
  • We have not been able to parse a particular sentence structure or formatting (e.g., "a drop of delicious honey with a shower of powder sugar")
  • The ingredient is not yet present in our taxonomy (or not with the right synonym or in the right language)

In the table below,

  • the first column of numbers corresponds to the number of unique ingredients across all products.
  • And the second column of number corresponds to the number of occurrences of those ingredients. e.g., if a specific ingredient appears in 5 products, it is counted 5 times.


JP - Japanese

2023-07-30 2022-07-31
All products:
2822 原材料:

Type	Unique tags	Occurrences
known	726 (25.73%)	7191 (69.84%)
unknown	2096 (74.27%)	3105 (30.16%)
all	2822 (100.00%)	10296 (100.00%)
2593 原材料:

Type	Unique tags	Occurrences
known	766 (29.54%)	8466 (77.51%)
unknown	1827 (70.46%)	2456 (22.49%)
all	2593 (100.00%)	10922 (100.00%)
Top 10k most scanned products:


HR - Croatian

2023-07-30 2023-07-31  2023-09-20
All products:
5700 sastojci:

Type	Unique tags	Occurrences
known	1304 (22.88%)	25105 (83.02%)
unknown	4396 (77.12%)	5134 (16.98%)
all	5700 (100.00%)	30239 (100.00%)
5254 sastojci:

Type	Unique tags	Occurrences
known	1197 (22.78%)	26194 (84.69%)
unknown	4057 (77.22%)	4735 (15.31%)
all	5254 (100.00%)	30929 (100.00%)
3420 sastojci:

Type	Unique tags	Occurrences
known	1413 (41.32%)	29483 (92.97%)
unknown	2007 (58.68%)	2229 (7.03%)
all	3420 (100.00%)	31712 (100.00%)
Top 10k most scanned products:


PL - Polish

2022-08-16 2023-08-15 2023-12-01
All products:
14121 składniki:

Type	Unique tags	Occurrences
known	1918 (13.58%)	127042 (89.13%)
unknown	12203 (86.42%)	15488 (10.87%)
all	14121 (100.00%)	142530 (100.00%)
||
16160 składniki:

Type	Unique tags	Occurrences
known	2048 (12.67%)	116070 (84.73%)
unknown	14112 (87.33%)	20910 (15.27%)
all	16160 (100.00%)	136980 (100.00%)
16478 składniki:

Type 	Unique tags 	Occurrences
known	1957 (11.88%)	135344 (87.87%)
unknown	14521 (88.12%)	18689 (12.13%)
all	16478 (100.00%)	154033 (100.00%)
Top 10k most scanned products:
3627 składniki:

Type	Unique tags	Occurrences
known	1290 (35.57%)	34703 (92.74%)
unknown	2337 (64.43%)	2718 (7.26%)
all	3627 (100.00%)	37421 (100.00%)
||
4406 składniki:

Type	Unique tags	Occurrences
known	1296 (29.41%)	31605 (88.30%)
unknown	3110 (70.59%)	4189 (11.70%)
all	4406 (100.00%)	35794 (100.00%)
3809 składniki:

Type 	Unique tags 	Occurrences
known	1309 (34.37%)	35540 (92.39%)
unknown	2500 (65.63%)	2926 (7.61%)
all	3809 (100.00%)	38466 (100.00%)

UK + English

2023-09-14 2023-12-01
All products:
60330 ingredients:

Type	Unique tags	Occurrences
known	2840 (4.71%)	649282 (88.62%)
unknown	57490 (95.29%)	83403 (11.38%)
all	60330 (100.00%)	732685 (100.00%)
63758 ingredients:

Type 	Unique tags 	Occurrences
known	2916 (4.57%)	723156 (89.04%)
unknown	60842 (95.43%)	89015 (10.96%)
all	63758 (100.00%)	812171 (100.00%)
Top 10k most scanned products:
7243 ingredients:

Type	Unique tags	Occurrences
known	1728 (23.86%)	79991 (92.23%)
unknown	5515 (76.14%)	6740 (7.77%)
all	7243 (100.00%)	86731 (100.00%)
7223 ingredients:

Type 	Unique tags 	Occurrences
known	1760 (24.37%)	83819 (92.64%)
unknown	5463 (75.63%)	6661 (7.36%)
all	7223 (100.00%)	90480 (100.00%)

Observations