17
edits
(Created page with "Placeholder") |
(Initial proposal without API) |
||
Line 1: | Line 1: | ||
== Overview == | |||
This document serves as a technical proposal for a new search API. | |||
=== Goals === | |||
Create a new search API to facilitate: | |||
* Autocomplete (currently unsupported) | |||
* Facet search (replacing our current search API) | |||
* Migrating internal services (such as the search bar) to this API. However, performing this migration is not in scope. | |||
=== Non Goals === | |||
* Improving the current API | |||
* Modifying the current search UX | |||
== Elasticsearch == | |||
Elasticsearch has been chosen for several reasons: | |||
* Advanced search functionality (easy to support features such as autocomplete) | |||
* Mature ecosystem | |||
* Easy to host, completely free version suits our needs | |||
=== Configuration === | |||
We will use Elasticsearch 8.3.3 (latest), deployed via Docker. We will use a replication factor of one, with shards split across two nodes. | |||
From testing on a M1 Macbook Pro, we see: | |||
* Memory: ~1GB per node | |||
* CPU: Low single digits | |||
* Disk usage: ~6GB after indexing, reduces to ~4GB over time | |||
* Latency: Autocomplete queries take ~0.1s on average | |||
* Import time (full index from CSV): ~15 mins | |||
Given the relatively low resource usage, we could likely use more nodes. For now, it is suggested to monitor resource usage, and add nodes (and increase the replication factor) if needed. | |||
=== Monitoring === | |||
We will use [https://elasticvue.com/ elasticvue] to see information such as resource usage, sharding, and perform debugging. Query information can be seen at elasticvue --> indices --> settings/cog --> Show stats. | |||
=== Data === | |||
The core datatype will be the ''Product''. | |||
To enable API cases such as partial text search, a rich autocomplete, and the possibility of eventually serving as a unified read layer, all fields will be added to the index. Only product names, brands and categories will be indexed for autocomplete queries. | |||
An argument could be made for storing fewer fields, and reducing disk usage. However, as illustrated above, disk usage is quite reasonable. | |||
== Search Service == | |||
The Search Service will be written in Python, using [https://fastapi.tiangolo.com/ FastAPI]. It will leverage [https://elasticsearch-dsl.readthedocs.io/en/latest/ Elasticsearch DSL] to aid in data modeling and query writing. | |||
=== Importing Data === | |||
A Redis container will be created, which will serve as a queue/buffer for writing data. When the [https://github.com/openfoodfacts/openfoodfacts-server/blob/af59dc1155a096328e9dc4710985a12a8be878c3/lib/ProductOpener/Products.pm#L968 ''store_product'' method] is called on the main service, a new entry will be added to the queue, containing the full product definition. A field will indicate if this is an upsert or a delete. | |||
The Search Service will consume from this queue, indexing (or deleting) each product as it receives messages. | |||
A manual import script will also be written, to take the CSV file and [https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html bulk import] items. To ensure that data from the manual import script is up to date (ie, no gap from the time of running the script and when data is imported), we need to: | |||
* Modify the Search Service to perform a set like: | |||
** ''SET product:<timestamp>:<barcode> <full_product_definition> EX 129600'' | |||
** Explanation: Set a key for each product that expires in 36 hours | |||
* Modify the import script to: | |||
** See when the last updated/created timestamp was, store as import_cutoff | |||
** Fetch all recent writes from redis with a ''SCAN'', matching anything of the form product:* | |||
** Iterate through timestamps in order that are after import_cutoff | |||
** Apply those writes | |||
=== API === | |||
Each API will return a maximum of 100 items. These APIs should also be unit tested. After these API changes are done, they should be exposed in the docs. | |||
The proposed API definition is below. Note that the requests are represented as Python objects as used in FastAPI - in reality, this is a JSON payload: | |||
== Work Plan == | |||
* Completed (locally): | |||
** Product document definition | |||
** Bulk import script | |||
** API definition | |||
** API implementation (partly implemented) | |||
** Docker, Elasticvue configuration | |||
* TODO: | |||
** Proposal alignment | |||
** Commit to OFF repo | |||
** Finish API implementation | |||
** Redis reader | |||
** Deploy (without any traffic) | |||
** Redis writer on the Perl side | |||
** Final testing | |||
** Document API | |||
** Ongoing monitoring |
edits