JSON schema for search-a-licious configuration file

Type: object

Search-a-licious server configuration.

The configuration is loaded from a YAML file,
that must satisfy this schema.

Validations will be performed while we load it.

Indices

Type: object

configuration of indices.

A Search-a-licious instance only have one configuration file,
but is capable of serving multiple datasets

It provides a section for each index you want to create (corresponding to a dataset).

The key is the ID of the index that can be referenced at query time.
One index corresponds to a specific set of documents and can be queried independently.

If you have multiple indexes, one of those index must be designed as the default one,
see default_index.

Each additional property must conform to the following schema

Type: object

This object gives configuration for one index.

One index usually correspond to one dataset.

Type: object

This is the configuration for the main index containing the data.

It's used to create the index in ElasticSearch, and configure its mappings
(along with the *fields* config)

Name

Type: string

Name of the index alias to use.

Search-a-licious will create an index using this name and an import date,
but alias will always point to the latest index.

The alias must not already exists in your ElasticSearch instance.

Number Of Shards

Type: integer Default: 4

Number of shards to use for the index.

Shards are useful to distribute the load on your cluster.
(see index settings)

Number Of Replicas

Type: integer Default: 1

Number of replicas to use for the index.

More replica means more resiliency but also more disk space and memory.

(see index settings)

Id Field Name

Type: string

Name of the field to use for _id.
it is mandatory to provide one.

If your dataset does not have an identifier field,
you should use a document preprocessor to compute one (see preprocessor).

Last Modified Field Name

Type: string

Name of the field containing the date of last modification,
in your indexed objects.

This is used for incremental updates using Redis queues.

The field value must be an int/float representing the timestamp.

Fields

Type: object

Configuration of all fields we need to store in the index.

Keys are field names,
values contain the field configuration.

This is a very important part of the configuration.

Most of the ElasticSearch mapping will depends on it.
ElasticSearch will also use this configuration
to provide intended behaviour.

(see also Explain Configuration)

If you change those settings you will have to re-index all the data.
(But you can do so in the background).

Each additional property must conform to the following schema

Type: object

Name

Type: string Default: ""

name of the field (must be unique

Type: enum (of string)

Type of the field

Supported field types in Search-a-Licious are:

* keyword: string values that won't be interpreted (tokenized).
  Good for things like tags, serial, property values, etc.
* date: Date fields
* double, float, half_float, scaled_float:
  different ways of storing floats with different capacity
* short, integer, long, unsigned_long :
  integers (with different capacity:  8 / 16 / 32 bits)
* bool: boolean (true / false) values
* text: a text which is tokenized to enable full text search
* text_lang: like text, but with different values in different languages.
  Tokenization will use analyzers specific to each languages.
* taxonomy: a field akin to keyword but
  with support for matching using taxonomy synonyms and translations
  (and in fact also a text mapping possibility)
* disabled: a field that is not stored nor searchable
  (see [Elasticsearch help])
* object: this field contains a dict with sub-fields.

Must be one of:

  • "keyword"
  • "date"
  • "half_float"
  • "scaled_float"
  • "float"
  • "double"
  • "integer"
  • "short"
  • "long"
  • "unsigned_long"
  • "bool"
  • "text"
  • "text_lang"
  • "taxonomy"
  • "disabled"
  • "object"

Required

Type: boolean Default: false

if required=True, the field is required in the input data

An entry that does not contains a value for this field will be rejected.

Input Field

Default: null

name of the input field to use when importing data

By default, Search-a-licious use the same name as the field name.

This is useful to index the same field using different types or configurations.

Split

Type: boolean Default: false

do we split the input field with split_separator ?

This is useful if you have some text fields that contains list of values,
(for example a comma separated list of values, like apple,banana,carrot).

You must set split_separator to the character that separates the values in the dataset.

Bucket Agg

Type: boolean Default: false

do we add an bucket aggregation to the elasticsearch query for this field.

It is used to return a 'faceted-view' with the number of results for each facet value,
or to generate bar charts.

Only valid for keyword, taxonomy or numeric field types.

Taxonomy Name

Default: null

the name of the taxonomy associated with this field.

It must only be provided for taxonomy field type.

Split Separator

Type: string Default: ","

separator to use when splitting values, for fields that have split=True

Lang Separator

Type: string Default: "_"

for text_lang FieldType, the separator between the name of the field and the language code, ex: productnameit if langseparator=""

Primary Color

Type: string Default: "#aaa"

Used for vega charts. Use CSS color code.

Accent Color

Type: string Default: "#222"

Used for vega. Should be CSS color code.

Type: object

Configuration of taxonomies,
that is collections of entries with synonyms in multiple languages.

See [Explain taxonomies](../explain-taxonomies)

Field may be linked to taxonomies.

It enables enriching search with synonyms,
as well as providing suggestions,
or informative facets.

Note: if you define taxonomies, you must import them using
[import-taxonomies command](../ref-python/cli.html#python3-m-app-import-taxonomies)

Sources

Type: array

Configurations of taxonomies that this project will use.

No Additional Items

Each item of this array must be:

Type: object

Configuration on how to fetch a particular taxonomy.

Name

Type: string

Name of the taxonomy

This is the name you will use in the configuration (and the API)
to reference this taxonomy

Url


URL of the taxonomy.

The target file must be in JSON format
and follows Open Food Facts JSON taxonomy format.

This is a dict where each key correspond to a taxonomy entry id,
values are dict with following properties:

  • name: contains a dict giving the name (string) for this entry
    in various languages (keys are language codes)
  • synonyms: contains a dict giving a list of synonyms by language code
  • parents: contains a list of direct parent ids (taxonomy is a directed acyclic graph)

Other keys correspond to properties associated to this entry (eg. wikidata id).

Type: stringFormat: uri

Must be at least 1 characters long

Type: stringFormat: uri

Must be at least 1 characters long

Must be at most 2083 characters long

Type: object

This is the configuration of
the ElasticSearch index storing the taxonomies.

All taxonomies are stored within the same index.

It enables functions like auto-completion, or field suggestions
as well as enrichment of requests with synonyms.

Name

Type: string

Name of the index alias to use.

Search-a-licious will create an index using this name and an import date,
but alias will always point to the latest index.

The alias must not already exists in your ElasticSearch instance.

Number Of Shards

Type: integer Default: 4

Number of shards to use for the index.

Shards are useful to distribute the load on your cluster.
(see index settings)

Number Of Replicas

Type: integer Default: 1

Number of replicas to use for the index.

More replica means more resiliency but also more disk space and memory.

(see index settings)

Preprocessor

Default: null

Type: string

The full qualified reference to the preprocessor
to use before taxonomy entry import.

This class must inherit app.indexing.BaseTaxonomyPreprocessor
and specialize the preprocess method.

This is used to adapt the taxonomy schema
or to add specific fields for example.


Example:

app.openfoodfacts.TaxonomyPreprocessor

Supported Langs

Type: array of string

A list of all supported languages, it is used to build index mapping

No Additional Items

Each item of this array must be:


Example:

['en', 'fr', 'it']

Document Fetcher

Type: string

The full qualified reference to the document fetcher,
i.e. the class responsible from fetching the document.
using the document ID present in the Redis Stream.

It should inherit app._import.BaseDocumentFetcher
and specialize the fetch_document method.

To keep things sleek,
you generally have few item fields in the event stream payload.
This class will fetch the full document using your application API.


Example:

app.openfoodfacts.DocumentFetcher

Preprocessor

Default: null

Type: string

The full qualified reference to the preprocessor
to use before data import.

This class must inherit app.indexing.BaseDocumentPreprocessor
and specialize the preprocess method.

This is used to adapt the data schema
or to add search-a-licious specific fields
for example.


Example:

app.openfoodfacts.DocumentPreprocessor

Result Processor

Default: null

Type: string

The full qualified reference to the elasticsearch result processor
to use after search query to Elasticsearch.

) This class must inherit app.postprocessing.BaseResultProcessor
and specialize the process_after

                This is can be used to add custom fields computed from index content.

Example:

app.openfoodfacts.ResultProcessor

Scripts

Default: null

Type: object

You can add scripts that can be used for sorting results.

Each key is a script name, with it's configuration.

Each additional property must conform to the following schema

Type: object

Scripts can be used to sort results of a search.

This use ElasticSearch internal capabilities

Type: enum (of string) Default: "expression"

The script language, as supported by Elasticsearch

Must be one of:

  • "expression"
  • "painless"

Source

Type: string

The source of the script

Params


Type: object

Params for the scripts. We need this to retrieve and validate parameters

Static Params


Type: object

Additional params for the scripts that can't be supplied by the API (constants)

Match Phrase Boost

Type: number Default: 2.0

How much we boost exact matches on consecutive words

That is, if you search "Dark Chocolate",
it will boost entries that have the "Dark Chocolate" phrase (in the same field).

It only applies to free text search.

This only makes sense when using
"boost_phrase" request parameters and "best match" order.

Note: this field accept float of string,
because using float might generate rounding problems.
The string must represent a float.

Match Phrase Boost Proximity

Default: null

How much we allow proximity for match_phrase_boost.

If unspecified we will just match word to word.
Otherwise it will allow some gap between words matching

This only makes sense when using
"boost_phrase" request parameters and "best match" order.

Document Denylist

Type: array of string

list of documents IDs to ignore.

Use this to skip some documents at indexing time.

All items must be unique

No Additional Items

Each item of this array must be:

Redis Stream Name

Default: null

Name of the Redis stream to read from when listening to document updates.

If not provided, document updates won't be listened to for this index.

Default Index

Type: string

the default index to use when no index is specified in the query