Graphing - en
Recently I had some experiences with creating graphs in order to analyse the product category Rices. I noticed several issues while doing this, that are worthwhile discussing for others.
First of all, there is a large discrepancy between the simple search results for the category Rices and the extended search results on the category Rices. No explanation for the moment. As the number of results was large enough, I could proceed.
The first question one has to ask is what dimensions will be used for the graph. I wanted to explore the diversity of the category Rices. Looking at the statistics revealed that the nutritional values of fibers and proteins showed the largest standard deviation, which implies a nice spreading in the graph.
And indeed this resulted in a a nice graph. But what were all those outliers. It turns out that a lot products were assigned to the category Rices, but weren.t for my purpose not Rices.
So how is the category Rices defined? No idea. Open Food Facts does not offer a definition of a category. There are no rules deciding when to assign a product to a certain category. And no quality checks to enforce those rules.
So I decided to clean up a bit (only the outliers), based on a definition I invented. The definition goes like this.
A product belongs to the category rices, if the ingredients are rices as well.
This implies that a lot of outliers could removed. They all had other ingredients, either rice meals for the microwave; mixes of cereals, lentils and rice; rices with aromas added; etc. Looking at the categories it seems that people base themselves on the ingredients to define a category.
There were still some outliers left, but those were really rices. I like to know what Uncle Ben does to their rice, in order to become an outlier. Uncle Ben is also fan of prepared nutritional values.
I did not assess the structure under rices, but the same questions arises. What is the definition? Have the whole rices also whole rice as ingredient? Are white rices all the rices that are polished, covering all rice types? Etc.
And maybe there is a relation with the Nova groups as well. The category Rices is probably entirely Nova group 1, which excludes also other non-Nova group 1 ingredients.
So when graphing, check the outliers and repair before you publish.