Search API V3: Difference between revisions

← Older edit

Search API V3 (view source)

Revision as of 09:13, 15 August 2024

3,791 bytes added , 15 August

no edit summary

VisualWikitext

Teolemon

Bureaucrats, Interface administrators, Suppressors, Administrators

6,323

edits

@@ Line 1: / Line 1: @@
 == Overview ==
-This document serves as a technical proposal for a new search API.
+This document serves as a technical proposal for a new search API - V3 ([[Open Food Facts Search API Version 2|see V2 here]]).
+'''NOTE''': '''We're now extracting search to [[Search-a-licious]], which has an OpenAPI documentation, an Elastic-Search backend and everything you'd expect from a modern search API.'''
+'''IMPORTANT:''' our goal have shifted a bit on this project. See [https://docs.google.com/document/d/1mibE8nACcmen6paSrqT9JQk5VbuvlFUXI1S93yHCK2I/edit Search-a-licious roadmap architecture notes]
 === Goals ===
-Create a new search API to facilitate:
+Create new search APIs to facilitate:
 * Autocomplete (currently unsupported)
@@ Line 22: / Line 25: @@
 === Configuration ===
-We will use Elasticsearch 8.3.3 (latest), deployed via Docker. We will use a replication factor of one, with shards split across two nodes.
+We will use Elasticsearch 8.3.3 (latest), deployed via Docker. We will use a replication factor of one (ie, two copies of the data in total), with shards split across two nodes.
 From testing on a M1 Macbook Pro, we see:
@@ Line 33: / Line 36: @@
 Given the relatively low resource usage, we could likely use more nodes. For now, it is suggested to monitor resource usage, and add nodes (and increase the replication factor) if needed.
+X-Pack security will be turned off, as we do not need to encrypt any communication to Elasticsearch or between the nodes. This simplifies the setup and install of elasticvue (below). If this index was used for more sensitive data in the future, this decision should be revisited.
 === Monitoring ===
-We will use [https://elasticvue.com/ elasticvue] to see information such as resource usage, sharding, and perform debugging. Query information can be seen at elasticvue --> indices --> settings/cog --> Show stats.
+We will use [https://elasticvue.com/ elasticvue] to see information such as resource usage, sharding, and perform debugging. Query stats (to monitor ongoing use) can be seen at elasticvue --> indices --> settings/cog --> Show stats.
 === Data ===
-The core datatype will be the ''Product''.
+The core datatype will be the ''Product''. This data will be stored sparsely (ie, fields that are not set will not be stored in the index) to reduce space.
-To enable API cases such as partial text search, a rich autocomplete, and the possibility of eventually serving as a unified read layer, all fields will be added to the index. Only product names, brands and categories will be indexed for autocomplete queries.
+To enable API cases such as partial text search, a rich autocomplete, and the possibility of eventually serving as a unified read layer, [https://static.openfoodfacts.org/data/data-fields.txt all fields] will be added to the index. Only product names, brands and categories will be indexed for autocomplete queries.
 An argument could be made for storing fewer fields, and reducing disk usage. However, as illustrated above, disk usage is quite reasonable.
@@ Line 48: / Line 53: @@
 === Importing Data ===
-A Redis container will be created, which will serve as a queue/buffer for writing data. When the [https://github.com/openfoodfacts/openfoodfacts-server/blob/af59dc1155a096328e9dc4710985a12a8be878c3/lib/ProductOpener/Products.pm#L968 ''store_product'' method] is called on the main service, a new entry will be added to the queue, containing the full product definition. A field will indicate if this is an upsert or a delete.
+A Redis container will be created, which will serve as a queue/buffer for writing data. Using this approach is preferable to a webhook as search service instability will not affect the main server, and we can better handle write spikes/DOS attacks.
+Data will be added to the queue when the [https://github.com/openfoodfacts/openfoodfacts-server/blob/af59dc1155a096328e9dc4710985a12a8be878c3/lib/ProductOpener/Products.pm#L968 ''store_product'' method] is called on the main service. This data will contain the full product definition and a field will indicate if this is an upsert or a delete.
-The Search Service will consume from this queue, indexing (or deleting) each product as it receives messages.
+The Search Service will consume from this queue via the [https://redis-py.readthedocs.io/en/stable/ Redis Python API], indexing (or deleting) each product as it receives messages.
 A manual import script will also be written, to take the CSV file and [https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html bulk import] items. To ensure that data from the manual import script is up to date (ie, no gap from the time of running the script and when data is imported), we need to:
@@ Line 63: / Line 70: @@
 ** Apply those writes
-=== API ===
+=== API Overview ===
-Each API will return a maximum of 100 items. These APIs should also be unit tested. After these API changes are done, they should be exposed in the docs.
+These APIs will:
+* Return a maximum of 100 items
+* Require robust unit tests
+* [https://openfoodfacts.github.io/api-documentation/#3SEARCHRequests Require documentation]
+* [https://github.com/openfoodfacts/openfoodfacts-server/tree/111e0afdbac3c20ea34652b0b413be58be6dfae5/conf/nginx Need to be exposed in the Nginx configuration]
+* Be prefixed with ''/v3'' (redirecting at the Nginx layer, but not using the v3 prefix at the Search Service layer)
 The proposed API definition is below. Note that the requests are represented as Python objects as used in FastAPI - in reality, this is a JSON payload:
@@ Line 82: / Line 95: @@
      field: str
      value: str
-     # One of eq, ne, like, without
+     # One of eq, ne, like
      operator: str = 'eq'
@@ Line 89: / Line 102: @@
      field: str
      value: float
-     # One of eq, ne, lt, gt, without
+     # One of eq, ne, lt, gt
      operator: str = 'eq'
@@ Line 96: / Line 109: @@
      field: str
      value: datetime.datetime
-     # One of eq, ne, lt, gt, without
+     # One of lt, gt
      operator: str = 'eq'
@@ Line 121: / Line 134: @@
 </pre>
+=== API Discussion ===
+The barcode GET API is included to demonstrate how this service could easily replace our existing read APIs, but is not intended to be used for the moment.
+The remaining APIs have several commonalities:
+* An optional ''response_fields'' parameter is provided, to limit the fields in the response further
+* POST is used, to support a complex request body
+The ''/search'' API is the most complex as it allows a series of filters to support the use cases in the [https://openfoodfacts.github.io/api-documentation/#3SEARCHRequests current API]. These filters will work like an intersection/AND query. Of interest are the:
+* ''like'' operator, which does not need to match to exact fields, but rather will match by use of the [https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-snowball-tokenfilter.html snowball token filter] in Elasticsearch if the field supports it.
+* ''gt'', ''lt'' (greater than, less than) operators are provided.
+* There is no support for an OR operator (which is supported in the [[Open Food Facts Search API Version 2|V2 API]]), however clients can perform this logic themselves if they wish through multiple API calls.
+=== Relationship With Existing APIs ===
+The autocomplete API does not overlap with the other APIs. However, the new proposed ''/search'' API has a lot of conceptual overlap with the current search APIs. It is proposed to encourage users to migrate to this new V3 API (and to use it when writing new code) through guidance in the API docs.
+Furthermore, once the new API has proven its stability, the legacy ''search.pl'' should be able to be switched over (with a translation layer to map between old and new API request and response norms). The v2 API can also be migrated, but will require a solution to supporting the OR operator (potentially dropping support, using another translation layer with multiple requests, or modifying the API).
+This work is considered out of scope for this proposal.
 == Work Plan ==
-* Completed (locally):
+* Completed
 ** Product document definition
 ** Bulk import script
 ** API definition
-** API implementation (partly implemented)
+** API implementation
 ** Docker, Elasticvue configuration
+** API tests
+** Redis Integration
 * TODO:
-** Proposal alignment
-** Commit to OFF repo
-** Finish API implementation
-** Redis reader
 ** Deploy (without any traffic)
 ** Redis writer on the Perl side
@@ Line 140: / Line 171: @@
 ** Document API
 ** Ongoing monitoring
+== Future Work ==
+There appears to be a lot of opportunities to utilize this infrastructure further:
+* Incorporate the autocomplete API in to the main search bar
+* Use the new search API to handle searches on the primary website (should result in improved ranking, matching, lower DB load)
+* Robotoff could use the same cluster
+* Current search APIs could be switched over (with a translation layer)
+[[Category:Search]]