Search API V3: Difference between revisions

From Open Food Facts wiki
mNo edit summary
(Changes to API)
Line 92: Line 92:
     field: str
     field: str
     value: str
     value: str
     # One of eq, ne, like, without
     # One of eq, ne, like
     operator: str = 'eq'
     operator: str = 'eq'


Line 99: Line 99:
     field: str
     field: str
     value: float
     value: float
     # One of eq, ne, lt, gt, without
     # One of eq, ne, lt, gt
     operator: str = 'eq'
     operator: str = 'eq'


Line 106: Line 106:
     field: str
     field: str
     value: datetime.datetime
     value: datetime.datetime
     # One of eq, ne, lt, gt, without
     # One of lt, gt
     operator: str = 'eq'
     operator: str = 'eq'


Line 136: Line 136:
The remaining APIs have several commonalities:
The remaining APIs have several commonalities:


* Only fields with a value are returned in the JSON response, to keep response size down without needing to specify fields manually
* An optional ''response_fields'' parameter is provided, to limit the fields in the response further
* An optional ''response_fields'' parameter is provided, to limit the fields in the response further
* POST is used, to support a complex request body
* POST is used, to support a complex request body
Line 144: Line 143:
* ''like'' operator, which does not need to match to exact fields, but rather will match by use of the [https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-snowball-tokenfilter.html snowball token filter] in Elasticsearch if the field supports it.
* ''like'' operator, which does not need to match to exact fields, but rather will match by use of the [https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-snowball-tokenfilter.html snowball token filter] in Elasticsearch if the field supports it.
* ''gt'', ''lt'' (greater than, less than) operators are provided.
* ''gt'', ''lt'' (greater than, less than) operators are provided.
* To maintain backwards compatibility, a ''without'' operator is provided, in which case the value will be ignored. Note that the value is still required, so a dummy value should be used (it was feared that making this optional would lead to client side bugs). This should be well documented.
* The without operator will match empty data, as well as the empty string and 0 for numeric fields. This supports queries such as "Get foods without carbohydrates" for backwards comptability, even though that would be better served by making a query where carbohydrates = 0 (ensuring that missing data is not returned in the results).
* There is no support for an OR operator (which is supported in the [[Open Food Facts Search API Version 2|V2 API]]), however clients can perform this logic themselves if they wish through multiple API calls.
* There is no support for an OR operator (which is supported in the [[Open Food Facts Search API Version 2|V2 API]]), however clients can perform this logic themselves if they wish through multiple API calls.



Revision as of 16:19, 5 August 2022

Overview

This document serves as a technical proposal for a new search API - V3 (see V2 here).

Goals

Create new search APIs to facilitate:

  • Autocomplete (currently unsupported)
  • Facet search (replacing our current search API)
  • Migrating internal services (such as the search bar) to this API. However, performing this migration is not in scope.

Non Goals

  • Improving the current API
  • Modifying the current search UX

Elasticsearch

Elasticsearch has been chosen for several reasons:

  • Advanced search functionality (easy to support features such as autocomplete)
  • Mature ecosystem
  • Easy to host, completely free version suits our needs

Configuration

We will use Elasticsearch 8.3.3 (latest), deployed via Docker. We will use a replication factor of one (ie, two copies of the data in total), with shards split across two nodes.

From testing on a M1 Macbook Pro, we see:

  • Memory: ~1GB per node
  • CPU: Low single digits
  • Disk usage: ~6GB after indexing, reduces to ~4GB over time
  • Latency: Autocomplete queries take ~0.1s on average
  • Import time (full index from CSV): ~15 mins

Given the relatively low resource usage, we could likely use more nodes. For now, it is suggested to monitor resource usage, and add nodes (and increase the replication factor) if needed.

X-Pack security will be turned off, as we do not need to encrypt any communication to Elasticsearch or between the nodes. This simplifies the setup and install of elasticvue (below). If this index was used for more sensitive data in the future, this decision should be revisited.

Monitoring

We will use elasticvue to see information such as resource usage, sharding, and perform debugging. Query stats (to monitor ongoing use) can be seen at elasticvue --> indices --> settings/cog --> Show stats.

Data

The core datatype will be the Product. This data will be stored sparsely (ie, fields that are not set will not be stored in the index) to reduce space.

To enable API cases such as partial text search, a rich autocomplete, and the possibility of eventually serving as a unified read layer, all fields will be added to the index. Only product names, brands and categories will be indexed for autocomplete queries.

An argument could be made for storing fewer fields, and reducing disk usage. However, as illustrated above, disk usage is quite reasonable.

Search Service

The Search Service will be written in Python, using FastAPI. It will leverage Elasticsearch DSL to aid in data modeling and query writing.

Importing Data

A Redis container will be created, which will serve as a queue/buffer for writing data. Using this approach is preferable to a webhook as search service instability will not affect the main server, and we can better handle write spikes/DOS attacks.

Data will be added to the queue when the store_product method is called on the main service. This data will contain the full product definition and a field will indicate if this is an upsert or a delete.

The Search Service will consume from this queue via the Redis Python API, indexing (or deleting) each product as it receives messages.

A manual import script will also be written, to take the CSV file and bulk import items. To ensure that data from the manual import script is up to date (ie, no gap from the time of running the script and when data is imported), we need to:

  • Modify the Search Service to perform a set like:
    • SET product:<timestamp>:<barcode> <full_product_definition> EX 129600
    • Explanation: Set a key for each product that expires in 36 hours
  • Modify the import script to:
    • See when the last updated/created timestamp was, store as import_cutoff
    • Fetch all recent writes from redis with a SCAN, matching anything of the form product:*
    • Iterate through timestamps in order that are after import_cutoff
    • Apply those writes

API Overview

These APIs will:

The proposed API definition is below. Note that the requests are represented as Python objects as used in FastAPI - in reality, this is a JSON payload:

class SearchBase(BaseModel):
    response_fields: Optional[Set[str]]


class AutocompleteRequest(SearchBase):
    text: str
    search_fields: List[str] = constants.AUTOCOMPLETE_FIELDS


class StringFilter(BaseModel):
    field: str
    value: str
    # One of eq, ne, like
    operator: str = 'eq'


class NumericFilter(BaseModel):
    field: str
    value: float
    # One of eq, ne, lt, gt
    operator: str = 'eq'


class DateTimeFilter(BaseModel):
    field: str
    value: datetime.datetime
    # One of lt, gt
    operator: str = 'eq'


class SearchRequest(SearchBase):
    # Works as an intersection/AND query
    string_filters: List[StringFilter]
    numeric_filters: List[NumericFilter]
    date_time_filters: List[DateTimeFilter]

These are then used as follows:

# Gets the product matching a barcode, included to demonstrate potential usage to replace the main read API, this should not be exposed unless it is decided to send all reads through this service
GET /barcode/<barcode>

# Autocomplete request
POST /autocomplete, body=AutocompleteRequest

# Fully fledged search
POST /search, body=SearchRequest

API Discussion

The barcode GET API is included to demonstrate how this service could easily replace our existing read APIs, but is not intended to be used for the moment.

The remaining APIs have several commonalities:

  • An optional response_fields parameter is provided, to limit the fields in the response further
  • POST is used, to support a complex request body

The /search API is the most complex as it allows a series of filters to support the use cases in the current API. These filters will work like an intersection/AND query. Of interest are the:

  • like operator, which does not need to match to exact fields, but rather will match by use of the snowball token filter in Elasticsearch if the field supports it.
  • gt, lt (greater than, less than) operators are provided.
  • There is no support for an OR operator (which is supported in the V2 API), however clients can perform this logic themselves if they wish through multiple API calls.

Relationship With Existing APIs

The autocomplete API does not overlap with the other APIs. However, the new proposed /search API has a lot of conceptual overlap with the current search APIs. It is proposed to encourage users to migrate to this new V3 API (and to use it when writing new code) through guidance in the API docs.

Furthermore, once the new API has proven its stability, the legacy search.pl should be able to be switched over (with a translation layer to map between old and new API request and response norms). The v2 API can also be migrated, but will require a solution to supporting the OR operator (potentially dropping support, using another translation layer with multiple requests, or modifying the API).

This work is considered out of scope for this proposal.

Work Plan

  • Completed (locally):
    • Product document definition
    • Bulk import script
    • API definition
    • API implementation (partly implemented)
    • Docker, Elasticvue configuration
  • TODO:
    • Proposal alignment
    • OFF-search repo creation, initial commit
    • Finish API implementation, unit tests
    • Redis reader
    • Deploy (without any traffic) - will need assistance for this, as well as Nginx configuration
    • Redis writer on the Perl side
    • Final testing
    • Document API
    • Ongoing monitoring

Future Work

There appears to be a lot of opportunities to utilize this infrastructure further:

  • Incorporate the autocomplete API in to the main search bar
  • Use the new search API to handle searches on the primary website (should result in improved ranking, matching, lower DB load)
  • Robotoff could use the same cluster
  • Current search APIs could be switched over (with a translation layer)