Olive oil - en

From Open Food Facts wiki

Introduction

In november 2020 @stephane asked to have a look at the products in the olive oil category. The category was in need of cleaning up. An olive oil should have a Nutriscore C, but other values were seen. So I had a look at the products and cleaned up a bit. This post is a log of my observations.

Background

Olive oils are a special category for Nutriscore, as they are a special case for the Nutriscore calculation. Olive oil is seen as one of the better oils, and thus may get a better score. This exception was recently introduced (can not find the change history).

However all oils are bad, so olive oil will never get a better score than C. This has provoked a lot of resistance in Italy and Spain, as olive oil is seen as a major export product (see). Those countries might leave out olive oil from the Nutriscore obligation.

Also a good reason to have a cleanup category.

Definition

https://en.wikipedia.org/wiki/Olive_oil is a simple product: it has only one ingredient: olives. The extraction process and origin might influence the actual composition of oils. This might be visible in the nutritional values and labelling on a product.

Completeness

On 13 dec 2020 the https://world.openfoodfacts.org/category/olive-oils is comprised of 6118 products. There might be other olive oil products, which are not yet categorised.

Robotoff might help finding, as it might have found some eligible products. On 13 dec 2020 there were no remaining questions on Robotoff.

There are many products for which we have no nutritional data. It is impossible to go through all these by hand. A quick look at these products reveals that some 10% have images of the nutritional tables from which the data can be extracted. Clearly there is a role for Robotoff to play here in automatically extracting the data, or just signalling us that there might be such data.

Interlopers

Are all the products in the Olive oils category indeed olive oils. We need to find the products that are not olive oils (the interlopers).

Number of ingredients

A first check is to look at the number of ingredients. And it turns out that are some products that seem wrongly classified.

 

Note that of the 6118 products only 1967 seem to have ingredients defined.

Not sure how I can list the products with more than 1 ingredient.

Another way to fin interlopers is by looking for strange nutritional values, but that can also be due to wrongly entered data.

Data quality

Nutritional values check

A next step to correct wrong nutritional data, I started by looking at the fat percentage. There are 6 products with more than 100% fat, so lets correct those first. It easier to list those with a query and correct them. For these products the wrong data has been added. One product had a fat content of 101g though.

 

The next edit round looks at fat percentages lower than 88%.

The final result of the fat percentage distribution. Note that the distribution has two peaks, one around 100% and one around 90%. It seems that some producers assume that their olive oils consist of 100% fat and others really have it measured.

Now that I did the fat percentage, I realised that I have to go through all nutritional values and look at values that are out if the ordinary.

Other checks

The nutritional values check also allowed to remove wrongly assigned products (very few).

I also added ingredients images, did the recognition and added subcategories.

Typical errors

I repaired the following errors:

  • wrong classifications
  • mixup between per serving and per 100g data
  • all nutritional values set to 0
  • forgotten to check per serving
  • no total fat percentage on label
  • some yuka users converted the per serving to per 100g values

Olive oil conclusions and thoughts

States overview

After having repaired most of the olive oil products, we can make an overview of the states.

Some highlights (status on 16 dec 2020):

  • Total in category: 6043
  • Has nutritional values: 5380 (89%)
  • Has ingredients: 1971 (33%)

Ingredients

What are the ingredients. Now we see mostly written olive oil, or something like that. But does that convey anything? Preferably we would like to know the olive variety (or varieties). The origin of the olives and the processes applied to extract the oil. Words like virgin, extra virgin, cold-pressed, AOP, the origin of the olives, etc provides more information and is good information.

Serving size

The products sold in the note a serving size on 15 ml on their packaging, which is equivalent to 1 tablespoon. Instead of the 15 ml, a weight of 14 g can be given. This weight corresponds to the specific gravity of 0.911 of olive oils (wikipedia). On products sold in Europe a serving size of 10 ml can sometimes be found. The usefulness of having a serving size for a cooking aid can be debated. In fact 15 ml seems a lot.

Nutritional values

Energy

Many products use, what seems as, standard canonical values for nutritional values. The canonical values for energy (wikipedia) are 3700 kJ or 880 kcal. The products use 900 kcal instead. In total there are 1957 products with this value.

The distribution shows this effect even more clearly:

 
Energy (kcal) distribution for olive oils (16 dec 2020)

As expected the same double peak is also seen in distribution of Energy in kJoule.

Fat

The distribution of the fat percentage also shows a double peak:

 
Fat distribution of olive oils (16 dec 2020)

This seems more strange. You would expect that an oil consists of 100% fat, or maybe a bit less if there are impurities. The other peak lies around the 91%, very much like the specific gravity of 0.911.

Conclusion is that some producers report their nutritional values per 100g, some per 100ml and some use the canonical value of 100% (2193 products). A few product have values close to 100%, which seems most honest.

If we have a look at the 100ml sample:

 
Distribution of fat percentage smaller than 97% (16-dec-2020)

Interestingly we have another distribution with two peaks. So where does the upper peak come from. Looking at the origin of the data, it looks like these products have nutritional values listed per serving. The canonical fat per serving is 14g for 15ml, which calculates to a "specific gravity" of 0.933, corresponding to this second peak. This larger value is due to rounding errors.

Saturated Fat

The distribution of the saturated fat percentages no not present obvious multiple peaks (although the distribution does not seem symmetrical:

 
Distribution of saturated fat percentages (16-dec-2020)

Unsaturated fats

There are quite some products (1252) that report on their unsaturated fat content (mono- and poly-unsaturated fat). The ratio between these two insaturated fats differs between products.

 
Correlation of the two unsaturated fats

You see if there are more mono-unsaturated fats, there are less poly-unsaturated fats. There seems to be two correlation lines, I wonder whether these are due to a difference between the aforementioned groups.

The omegas

A few products report on the values of the Omega-3, -6 and -9 values. As this is always a standard(?) fraction of the unsaturated fats, it does not add much value at the moment

Vitamins

Quiten often the Vitamin E and Vitamin A are indicated. Quite strange. As if you are going to take olive oil for your vitamins. I guess the producers want to improve the health standing of their products.

Conclusion

In conclusion we have three different group of products, whose data we can not compare. So we need to normalize before we can analyse any further.

We can define the standard nutritional value ranges for olive oils:

Nutritional
Element
10%
percentile
Mean 90%
percentile
Energy (kJ) 3090 3550 4010
Energy (kcal) 750 862 970
Fat (g) 91 95.5 100
Saturated Fat (g) 13 13.9 16
Mono-unsaturated fat (g) 67 69.7 78
Poly-unsaturated fat (g) 6.6 9.7 13
Vitamin E (mg) 20 25 40
Vitamin A (Âľg) 200 223 330
Carbohydrate 0 - 0.5
Sugars 0 - 0.5
Proteins 0 - 0.5
Fiber 0 - 0.5
Sodium 0 - 0.5

Subcategories

OFF has defined multiple subcatgories based on origin (country) or listed quality (virgin, extra virgin).

The virgin olive oils use a first pressing of olives. The second extraction is based on what is left over and are called pomace olive oil (wikipedia). OFF seems to have used refined olive oils as a label for this category. Not sure whether this refined category is the same as the pomace category, we should check the labels of the products.

Possible origins are now France (30 products), Greece (43 products) and Italy (49 products). We should add at least Tunisia, Maroc, Algeria, Spain, Argentina and South Africa to these.

In addition the countries can be subdivided into regions, which correspond to official PDO's. We can slowly add the PDO's as the appear in OFF. Only 72 olive oils have a PDO label. There are a lot if labels missing, as there are more products in the combined PDO categories.

It is unclear whether these subcategories exhibit also differences in nutritional values. Before we can determine that we need more categorisations.

New categories

By looking more closely at all the products we can identify other categories. Only if there are a lot of products in each category or if the nutritional values deviate to much, it is worthwhile to create these new products

  • pure olive oils: the current olive oils category should be renamed to pure olive oils to indicate that these products contain only one ingredient. This helps also to distinguish from the other olive oils.
  • olive oil sprays for olive oil contain other ingredients to make it sprayable. As the serving is only 0.25g, the nutritional values per serving are all zero (thanks to rounding).
  • enhanced olive oils: these olive oils have added vitamins for children(?)
  • flavoured olive oils: these olive oils have added flavours (garlic, etc.)
  • Olive oil blends: some oils are a blend of refined and virgin olive oils. Or a blend of virgin or extra virgin oils.
  • Unfiltered olive oils: some oils show that they are unfiltered. This might be another category.

NOVA

What should be the value of the NOVA-score for olive oils? It seems to be now mainly NOVA 2. But shouldn't it be NOVA 3 for non-virgin olive oils. The pomace oils are created through chemical processes. A NOVA downgrade seems appropriate. This would also mean that the standard value for the category Olive oils would be NOVA 2, only if the product would be assigned to Virgin olive oils, it would turn NOVA 2.

Nutriscore

The Nutri-score for olive oils should be all the same: 10 points for energy, 1 point for saturated fat to fat ratio and -5 points for being olive oils. This will calculate to 6 points and thus NutriScore C.

Eco-score

The environmental score for extra virgin olive oil is 0.6, which is comparable to the other oils. No additional subdivisions are available. This imples that the Eco-score grade will be C.

Normalisation

As show in the previous section the Olive Oils category can be split into 3 subgroups, based on how the nutritional values are reported:

  • per 100g
  • per 100ml
  • per volumetric serving

And maybe there is even a fourth group: per weight serving.

If we know how the nutritional data for a product is reported, we can normalise that data. And with the normalised data we have a consistent dataset, which can be used to get the real nutritional values.

Categorisation

The first step to categorise each product into each of the groups.

Volumetric serving

The per volumetric serving group is the easiest. We need to know the type of nutritional table that has been used to extract the data from. Unfortunately this is not registered, so we need to find a proxy for this.

We could use the serving size field for this: if it contains ml, it might be taken from a per serving nutritional table. I have seen exceptions however. Unfortuately the web-interface does not allow me to search on that.

A better proxy is the existence of the transfat field. If there is data in that field, it is most likely a US or Canadian style nutritional table. Again this not a guarantee as other countries mark transfat as well. This results in 724 products. However some 60 products of this transfat sample have fat percentages around 100%, so we have false positives. We can add a fat limit of 94%, resulting in 629 products.

The US/Canadian style nutritional tables do not show the energy in kJoule. Thus if there is a value in that field, it is not US/Canadian. This results in 746 products. This is a low number, inducating that the field has not been filled in very well.

Per 100 ml

Defining this group of products is more difficult. We have only the fat percentage itself as indication. As the fat percentage can not be larger than 100%, we can use the specific gravity (91.1%) as limit. This assumes all oils have the same specific gravity.

Looking at some products, the canonical rounded value is 91%, but higher values are seen as well. So we could take a limit 92%. This results in 1475 products.

This will lead to some false postives. So it is better to remove first the US-products and then extract these groups.

Normalisation

We need to convert the volumetric data to weight data. For this we can use the correction factors, which we determined earlier:

  • per 100 ml: divide by 0.911
  • per 15 ml: divide by 0.933

Calculations

In order to get correct values, the data has been exported in CSV-format. Imported in Numbers, pruned (all non-relevant data removed), scaled by the normalisation factors, merged into one table, and finally plotted and averaged.

Only the specific gravity needed to be adapted in order to get only a few fat percentages below 100%

Results

Grouping results

For making the three groups

This resulted in group sizes of:

  • for 100g : 94 products
  • for 100ml: 235 products
  • for serving: 636 products

Distributions

We need to plot the same distributions as presented earlier, so that we can see if the corrections had any effect.

Fat
 
Distribution of normalised fat percentage (17-dec-2020)

The normalisation has worked well. All three distributions have their maximum at 100%. The largest group of 100ml has a tail below the 100%. This might be due to filtration or different specific gravities. I need to redo the graph: the x-axis label is wrong.

° Saturated fat

The distribution spans some 4%. Normalisation seems to have worked well.

 
Distribution on normalised saturated fat percentage (17-dec-2020)
Mono-unsaturated fat

The range of mono-unsaturated fat spans some 10%. It seems there are two peaks visible in the three groups.

 
Distribution of normalised mono-unsaturated fat percentage (17-dec-2020)
Energy (kcal)
 
Distribution of normalised Energy (kcal) (17-dec-2020)

The three peaks seem to indicate that there are three groups. The peak seen around 857 kcal seems a off. We would expect it to be around 900 kcal. Products in this peak belong to 100g group and the per serving group. It looks like the normalisation (which was based on the fats), did not work out well. So what happened?

Th 100g-group products seem to come from the USDA import, but the data was per 100g and not per serving as you would expect for this import. So the normalisation was already done, but wrong.

The canonical energy (kcal) value per serving is 120 kcal. And the canonical energy (kcal) value for the 100g-group is 900 kcal, i.e. 7.5 times larger. Where could this 7.5 factor come from? The conversion to 100 ml plus de conversion to gram, only provides a factor of 7.14. Hence the value of 857 kcal, which we see in the distribution.

I wonder whether this difference is due to rounding issues. The US style nutritional table likes to have a simple serving size: 1 tablespoon, which translates to 15ml. But are tablespoons in de US measured in milliliters? It says so on the packages. If we calculate the tablespoon-size from 120 and 900 kcal, we get a size of 14.63 ml, i.e. rounded 15 ml. We better use this value for normalisation.

Or 123 kcal per 15 ml would have been more accurate.

Correlations

  • Energy versus fat

The US/Canadian nutritional tables often have an entry Energy from Fat. So we would expect a correlation between energy and fat percentage. This is shown in the graph below. Both the fat percentage and the Energy (kcal) have been normalised.

 
Distribution of Fat percentage per Energy (kcal) (17-dec-2020)

The graphs shows the 100mg group as blue dots, the per serving group as green dots and the 100ml group as red dots.

The distribution of red dots seems to indicate a trend. This is however optical. The corresponding trend lines are shown as lines. There is hardly a correlation between fat percentage and Energy. The trendlines lie below the 100% fat percentage, due to a number of products having values below 100%.

The difference in fat percentage can be explained by differences in specific gravity.

  • Energy versus saturated fat
 
Correlation between energy and saturated fat percentage (17-dec-2020)

There is also no correlation between the saturated fat content and the energy.

  • Mono-unsaturated versus poly-unsaturated fat
 
Correlation between normalised poly- and mono-unsaturated fat percentage (18-dec-2020)

As expected there is a good correlation between the mono- and poly-unsaturated fat percentages. The normalisation worked out well.

  • Fat versus fats combined
 
Correlation between normalised fat percentages and saturated- /unsaturated fats combined (18-dec-2020)

You would expect that the combined saturated- and unsaturated-fats add up to the same as the fat percentage. This is tested in the correlation above. There seems to be an upward correlation. The added values can be some 10 above or below the total fat percentage. Is this an indication of the accuracy of the listed fat percentages?

Statistics

We can redefine the nutritional values based on the normalised data:

Nutritional
Element
10%
percentile
Mean 90%
percentile
Energy (kcal) 857 888 900
Fat (g) 98.9 99.6 100
Saturated Fat (g) 13.7 14.5 16.3
Mono-unsaturated fat (g) 71.4 72.9 78.5
Poly-unsaturated fat (g) 7.1 11.2 14.3

OFF Conclusions and thoughts

  • OFF quality

During the cleanup I edited a lot of products. This gives an indication of the quality of the OFF data. The counter stands at 318 products of the 6115 products in the olive oils category, i.e. 7%.

  • Quality indicators

Many olive oils sold in the USA show extra quality indicators (eg this one). We could add these parameters to the nutritional values. These are mainly expressed as limits (less than or more than).

  • US import issues?

Many US product are not indicated by serving. In the total fat field, the monounsaturated fats are shown. Did we have an import issue?

  • Quality check

Can the verified average values be used as a quality check on new products? A flag could be raised if one of the nutritional values is outside the allowed range.

  • Missing Ingredients

Many products have no Nova calculated as their ingredient list is empty. Could we default to olive oil as ingredient? We could define this category as a base food with a single ingredient. This in turn could be used as quality check.

  • Wrong nutritional values

Several products present nutritional values that is clearly wrong. Now I edited these out. I rather leave them and raise a flag, so that they are not used in the calculations.

  • Nutritional data per volume or weight

It should be possible to indicate whether the nutritional data on the package is given per volume or per weight. This would allow for normalisation afterwards based on specific gravity.

  • Category names

There are some category names that are singular, which must be changed to plural (mostly oil versus oils).

Should the origin be written as country or adjective? i.e. olive oils from France versus French olive oils.

Should a PDO be part of the category name? It would make them more easy to recognise.