Server-side product indexing and search - Functional Specs

From Open Food Facts wiki

Summary

This page contains the functional specifications for the Server-side product indexing and search task of the Project:Personalized_Search project.

It lists the options we are considering and the one we choose.

Search API

The search API is called by the app to retrieve a high number of generic search results (products) that match a query.

There are some choices we need to make:

  1. How do we declare what kind of products we want to get back (matching)
    1. Recommendation: by category + by keyword
  2. Which of the matching products do we send back (ordering)
    1. Recommendation: query-independent sort key based on product data completeness, data quality, product quality and product popularity

Match (required)

The search query needs to contain at least one required criteria.

Here are potential criteria that we could consider:

Match on category

The query specifies a category tag (e.g. en:cookies).

Pros:

  • Can be used to display product recommendations for a given product (using the most specific category of the product)
  • Can be used for a search box that is limited to categories (e.g. suggest as you type restricted to known categories)
  • Can be used for a per category search interface that display 1st level choices, 2nd level etc. (e.g. first click on "Vegetables", then "Bananas" etc.)
  • There could be ways to pre-compute per category results that could be used to speed up queries

Cons:

  • Does not support freely typed search queries (e.g. with a product name or brand)


Match on keywords

Allow users to enter keywords that can be matched indifferently to product name, brand, categories etc. This is how the current search function on the OFF web site is implemented.

Pros:

  • Can be used for a search box

Cons:

  • Currently does not work well for products in multiple languages
  • Difficult to do pre-computations to optimize queries

Match on product

Given a product as input, return "similar" products.

Pros:

  • Can be used for products recommendations
  • Similarity could be based on other things that the category (e.g. labels, ingredients, brands etc.)

Cons:

  • It could be very expensive to retrieve similar products in real-time or to pre-compute all of them
  • Does not work for category search of freely typed search queries

Match a given set of products

Given the barcodes of multiple products, return the corresponding products.

This is to be used to make in-store personalized recommendations: the user scans all possible product choices he/she is contemplating (e.g. 5 different breakfast cereals), and we return the corresponding products, which can then be filtered/ranked according to the user preferences.

Optional filters

The search query can contain filters to restrict the result set. e.g. the country where the products are sold.

Those filters may also include user preferences for apps who do not want to do the personalization locally.

Availability

We can filter by country of availability, by store (or stores) of availability, and number of scans (a proxy for wide availability)

Non-personalized sort order

It is unrealistic to return all possible results to the calling app for all queries. E.g. if the query is "cookies", we cannot return tens of thousands of results. So we will need to use some kind of reasonable sort order that returns the most useful results for the app, without limiting the options for personnalization of the results.

Possible criteria to include in the sort order:

Completeness of data

Products for which we have enough data (e.g. ingredients, nutrition facts)

Data quality

Products for which there are no data inconsistencies (e.g. warnings on nutrition facts)

Product quality

If we expect that the app will only offer "better" products and not worst products, we may include things like the Nutri-Score or NOVA in the sort order.

Popularity

We could also include a measure of product popularity (e.g. derived from scans) in the sort order.

Number of results returned

We need to return at least 100 results so that the app has enough results to personalize the results set.

We could support more than 100 results, and possibly offer to request further results (e.g. an app could request 100 results, and then decide to request 100 more).

Returned fields

We could either return a fixed set of fields that includes all fields needed for personalization, or possibly let the app declare which fields it wants returned.

Server-side indexing

Based on the type of queries we want to support, we may be able and may need to pre-compute and store new information in the database to better support the queries.

Similarity

If we decide to support product match, we need to pre-compute and store lists of similar products.

Sort order key computation

We should pre-compute the sort order key for all products.

Performance

  • 80% of search queries need to return results in less than 5 seconds.
  • 95% of search queries need to return results in less than 10 seconds.

Search queries should not overload the OFF servers (API server + MongoDB server)