75
edits
No edit summary |
No edit summary |
||
Line 170: | Line 170: | ||
* https://github.com/openfoodfacts/bio-codes | * https://github.com/openfoodfacts/bio-codes | ||
=Methodology to extract data= | |||
==General== | |||
===Get Data Files=== | |||
* All data (url list and data files) can be found here https://github.com/openfoodfacts/eu-food-data | * All data (url list and data files) can be found here https://github.com/openfoodfacts/eu-food-data | ||
**There is a general csv file with the general link to the data repository for each country: https://github.com/openfoodfacts/eu-food-data/blob/master/list-eu-and-partner-countries.csv | **There is a general csv file with the general link to the data repository for each country: https://github.com/openfoodfacts/eu-food-data/blob/master/list-eu-and-partner-countries.csv | ||
Line 178: | Line 178: | ||
* A google sheet document is used to map all files available in the target countries. It also map the section name for every country in its own language (or translation in English) and the related European Section, which is used as a general taxonomy. This Google Sheet can be found here: https://docs.google.com/spreadsheets/d/1egdo58Ds8PNi5G_4F2UtWOWC1V0k3tXBgPhZXs5FRqM/edit?usp=sharing | * A google sheet document is used to map all files available in the target countries. It also map the section name for every country in its own language (or translation in English) and the related European Section, which is used as a general taxonomy. This Google Sheet can be found here: https://docs.google.com/spreadsheets/d/1egdo58Ds8PNi5G_4F2UtWOWC1V0k3tXBgPhZXs5FRqM/edit?usp=sharing | ||
*We can usually find txt or csv file but for some countries the data is only available in PDF. Those need OCR treatment before data extraction. | *We can usually find txt or csv file but for some countries the data is only available in PDF. Those need OCR treatment before data extraction. | ||
===Build CSV files=== | |||
*Several formats are used in EU countries. A specific approach is needed for each of them. Refer below for details for each country. | *Several formats are used in EU countries. A specific approach is needed for each of them. Refer below for details for each country. | ||
===Geocode=== | |||
*Script geocoding + google maps | *Script geocoding + google maps | ||
*Local Authorities List in UK http://localweblist.net/ | |||
==France== | |||
*I just build a script that takes all the french agreement info from Agriculture Ministry and concatenate them in one file. Next step is to do the same for UK. The step after that is to cleverly agregate the duplicates (some companies have several health agreements under the same agreement number) | *I just build a script that takes all the french agreement info from Agriculture Ministry and concatenate them in one file. Next step is to do the same for UK. The step after that is to cleverly agregate the duplicates (some companies have several health agreements under the same agreement number) | ||
https://github.com/openfoodfacts/eu-food-data/blob/master/scripts/FR-script.py | https://github.com/openfoodfacts/eu-food-data/blob/master/scripts/FR-script.py | ||
This script use this file to get the list of URL to retrieve https://github.com/openfoodfacts/eu-food-data/blob/master/fr/urls-fr.txt | This script use this file to get the list of URL to retrieve https://github.com/openfoodfacts/eu-food-data/blob/master/fr/urls-fr.txt | ||
*First work performed on | *First work performed on | ||
==UK== | |||
*As UK is divided in 4 regions (Ireland, England, Wales and Scotland) and because they have different file format, we use a 3-file script | *As UK is divided in 4 regions (Ireland, England, Wales and Scotland) and because they have different file format, we use a 3-file script | ||
https://github.com/openfoodfacts/eu-food-data/blob/master/scripts/UK-urls.txt => all UK urls | https://github.com/openfoodfacts/eu-food-data/blob/master/scripts/UK-urls.txt => all UK urls | ||
https://github.com/openfoodfacts/eu-food-data/blob/master/scripts/UK-methods.txt => list which method to use depending on the file type | https://github.com/openfoodfacts/eu-food-data/blob/master/scripts/UK-methods.txt => list which method to use depending on the file type | ||
https://github.com/openfoodfacts/eu-food-data/blob/master/scripts/UK-script.py => the script itself | https://github.com/openfoodfacts/eu-food-data/blob/master/scripts/UK-script.py => the script itself | ||
==DE== | |||
*@vince has per | *@vince has per | ||
==Inspiration== | ==Inspiration== | ||
*http://free.sourcemap.com/ | *http://free.sourcemap.com/ |
edits