Student projects/GSOC/Proposals: Difference between revisions

From Open Food Facts wiki
No edit summary
Line 130: Line 130:
=== Automatically detect errors ===
=== Automatically detect errors ===


* Bad nutrition facts
* Detect products that likely have bad nutrition facts
** e.g. by looking at outliers for products of the same category
** e.g. by looking at outliers for products of the same category, even if the product doesn't have any category.


== Backend server and database ==
== Backend server and database ==

Revision as of 15:39, 5 February 2019

Open Food Facts has been selected as one of the mentor organizations for the 2018 Google Summer of Code and is a 2019 Google Summer of Code candidate.

Here are some ideas to help you build the strongest and most impactful proposals to submit for the Summer of Code program.

This page lists the key areas where we need the most help. You are of course very welcome to propose other project ideas, and we are looking forward to discussing these ideas and yours.

Students and Mentors Welcome!

We are looking for both students to work on projects, and additional mentors to help them. If you would like to be a GSoC mentor, please contact @stephane or @teolemon on Slack. In particular we are looking for more mentors in the fields of data science, computer vision, machine learning, Android and iOS developments so that we can accept even more projects this year.

Please join us on our Slack and introduce yourself on the #summerofcode channel.

Building strong proposals

Project ideas will need to be turned into strong project proposals. Here are some guidelines on how to write strong proposals for the Google Summer of Code.

To make your proposals more relevant, please take some time to familiarize yourself with the Open Food Facts project, and how the database is crowdsourced:

  • Explore our web site https://world.openfoodfacts.org , start with the Discover and Contribute pages.
  • Install our Android or iOS mobile app, scan some food products, and add photos for a few products from your country
  • Create an account on the web site, look up the products that you added, and edit the product pages to fill in the data for ingredients, nutrition facts etc.
  • Join us on Slack, request an instant invite

To discuss ideas, please join us on our Slack:

Google Summer of Code 2019 Project ideas

New features for the Open Food Facts Android and iOS apps to drive mass adoption and mass contribution

Why it's important: most of the data in the Open Food Facts database come from crowdsourcing through mobile apps: users scan barcodes of products and send us photos and data for missing products. We need to add features to our app thats bring a lot of value to users so that we gain mass adoption, and that have powerful features to contribute photos and data as easily and quickly as possible.

Key features needed:

Offline mode

  • A small version of the database needs to be included in the app (at install, and then synced regularly)
    • All products, but only key data
  • When scanning products, key data should be shown instantly, even if there's no network
  • History of scanned products, and full data for these products should be saved locally on the device
  • Offline contribution
    • While offline (e.g. in a store with no network), users need to be able to scan and take photos for lots of products
    • Photos should be sent when network becomes available

Drip editing (Contributing data to Open Food Facts by answering simple questions)

  • Every little helps. Drip editing means asking Open Food Facts users little questions about the product they are looking at. They should take a split second to answer. Put together, they helps complete products quicker, update existing products and ensure quality. This project is about introducing drip editing, in collaboration with the backend team in either the Android or the iOS version. This means creating a server that can provide those questions to the clients (the official apps, 3rd party clients on the web or mobile) and apply the answers on the associated product. Extra care should be given to a flexible architecture that embeds as little logic as possible in the clients, and make the server part take care of the heavy lifting.
  • Sample mockups at Drip Editing

Personnalisation and recommendations

  • Users should be able to provide data about them (age, sex, weight etc.) and their diet restrictions (e.g. allergens, vegan, religious) and preferences (organic, no GMOs, no palm oil..)
  • This data needs to be stored locally on device, and not sent to Open Food Facts and 3rd parties
  • Grade scan products based on this data
  • Display product recommendations / alternatives that better match the user preferences

Computer vision

Why it's important: all product data comes from photos of the product and labels. Today most of this data is entered manually. In order to be able to scale, we need to extract more data from photos automatically.

Background: We currently only do basic OCR for ingredients. There is a lot of room for improvement.

Sample data set:

  • Slack channels: #ai-computervision #ai-machinelearning
  • Github AI / machine learning: openfoodfacts-ai

Improve OCR for ingredients

  • Current state: on the server side, we use Tesseract and Google Cloud Vision, but the results are far from perfect
  • Create golden test sets to measure accuracy of the current OCR and improvements
  • Train OCR models targeted for ingredients
  • Automatic cropping of ingredients lists with language recognition, from larger images

Bring the Nutrition Table extraction system to the next level

  • Sagar built an automatic recognition, cropping and extraction system of nutrition facts tables
  • The project would consist in exploring various techniques to massively increase the recognition rate (currently 10-20%)

Brands and labels detection

  • Brands, logos and labels extraction*

The OFF images contain several logotypes and labels that have essential nutritional information (Bio, gluten-free, etc.). Currently, this information is manually inserted into our database. We need to create an extensible machine-learning tool that extracts and categorises most of the labels and logos.

The expected API tool must be ready for production and be prepared to include future labels and logos with a minimal computational effort (we don’t want to re-train the complete model just for every new logo). Ideally, the system should have a web interface so that contributors can help add new logos.

Mentor: Fernando Villanueva

Product recognition

  • Recognize the front of products using the camera of the mobile phone, without scanning barcodes
  • Ideally offline and on device recognition (without having to take a picture and sending it to a server)

Packaging recognition

  • Is it a can, a bottle, a cardboard pack etc.

Create public training sets and data sets for all problems listed above

  • Similar to the Flickr logo set: http://image.ntua.gr/iva/datasets/flickr_logos/
  • Create a set that computer vision students can use to test different algorithms
    • Specify what needs to be included in each set, size, data, representativeness, most useful data format etc.
    • Create the set using real Open Food Facts photos and images

Data science

Why it's important: our product database is growing rapidly (730k products, with 10k new products every month in early 2019), we need automated ways to extract and validate data

Background: We have started in the past year to ramp up effort, and we have processed 1,5 million images with OCR and general entity, barcode and QR-code recognition. The result is 1,5 million matching JSON files with bounding boxes.

  • Slack channels: #ai-machinelearning
  • Github AI / machine learning: openfoodfacts-ai

Automatically classify products

  • Detect field values from other field values or bag of words from the OCR
    • Categories
    • Brands (in some cases, a strong feature can be the barcode prefix)
    • Labels
  • When certain, detected values can be applied immediately
  • When less certain, we can ask users to confirm suggestions

Automatically detect errors

  • Detect products that likely have bad nutrition facts
    • e.g. by looking at outliers for products of the same category, even if the product doesn't have any category.

Backend server and database

Backend Improvements, New Features and Performance

  • The Open Food Facts backend and web site is coded in Perl and the data is stored in MongoDB
  • We currently have performance issues with MongoDB requests that could be optimized
  • There are many other areas for improvements for the backend, see https://github.com/openfoodfacts/openfoodfacts-server


Other projects

Taxonomy Editor

  • Slack channels: #taxonomies #perl #python #php #nodejs
  • We define and use multilingual taxonomies for categories, labels, ingredients and other fields.
  • Those taxonomies are directed acyclic graphs (hierarchies where a child can have multiple parents).
  • They are currently defined in text files hosted on our wiki: https://en.wiki.openfoodfacts.org/Global_taxonomies but it is becoming unmanageable (the biggest taxonomy for categories is 37k lines long).
  • We need a tool that makes it easy to edit the taxonomy and translate it.
  • The code that handles taxonomies is coded in Perl, but the taxonomy editor could be coded in any language (e.g. python, nodejs, php etc.)

Very ambitious projects

Next generation open source barcode reader

The leading open source barcode reader (Zxing) is in maintenance mode, and it is now lagging behing closed source solutions such as Scandit. While the quality of smartphone cameras and processing power has dramatically increased, the ability to quickly and accurately scan barcodes is still not adequate in many real life situations (e.g. scanning barcodes on food packages that are not completely flat).

A new open source barcode reader that uses today's cameras and processors would not only benefit Open Food Facts, but also thousands of other applications that currently using Zxing.

Potential features:

  • using machine learning instead of standard computer vision methods
  • continuous scanning in all directions (instead of having to align a red line with the barcode)
  • using OCR on the printed digits to enhance the accuracy of recognitions


2018- 2017 Project ideas

Your ideas

Please feel free to submit proposals for other ideas that you have. If you have other ideas, please talk to us about them as early as possible, so that we can give you early feedback.

Thank you!